Skip to yearly menu bar Skip to main content


Poster

3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects

Zhicheng Liang ⋅ Haoyi Yu ⋅ Boyan Li ⋅ Dayou Zhang ⋅ Zijian Cao ⋅ Tianyi Gong ⋅ Junhua Liu ⋅ Shuguang Cui ⋅ Fangxin Wang
Jun 5, 4:00 PM - 6:00 PM ExHall A & F
Accurate 3D reconstruction of objects with reflective, transparent, or low-texture surfaces remains a significant challenge. Such materials often violate key assumptions in multi-view reconstruction pipelines, such as photometric consistency and the reliance on distinct geometric texture cues. Existing datasets primarily focus on diffuse, textured objects, thereby offering limited insight into performance under real-world material complexities. In this paper, we introduce 3DReflecNet, a large-scale hybrid dataset exceeding 22 TB that is specifically designed to benchmark and advance 3D vision methods for these challenging materials. 3DReflecNet combines two types of data: over 100,000 synthetic instances generated via physically-based rendering of more than 10,000 shapes, and over 1,000 real-world objects scanned using consumer RGB-D devices. Together, these data consist of more than 7 million multi-view frames. It encompasses diverse materials, complex lighting conditions, and a wide range of geometric forms—including shapes generated from both real and LLM-synthesized 2D images using diffusion-based methods. To support robust evaluation, we design benchmarks for four core tasks: image matching, reflection removal, structure-from-motion, and novel view synthesis. Through extensive experiments, we show that state-of-the-art methods struggle to maintain accuracy across these settings, highlighting the need for more resilient 3D vision models. We release the dataset, baselines, and evaluation suite to facilitate progress in this direction, which can be accessed at supplementary materials.
Show more
View full details
Poster

Adversarial Style Optimization: Enhancing VLM Jailbreaks by GRPO-based Stylistic Triggers Optimization

Bingjun Luo ⋅ Jialin Guo ⋅ Yue Yao ⋅ Xinpeng Ding
Jun 5, 10:45 AM - 12:45 PM ExHall A-F
Multimodal Large Language Models (MLLMs) have achieved impressive performance, but their safety alignment remains vulnerable to jailbreak attacks. Existing content-based jailbreaks are often inconsistent and show low attack success rates (ASR) against commercial closed-source MLLMs, failing to exploit non-content-based vulnerabilities. Unlike previous research, we empirically find that MLLMs exhibit a Stylistic Inconsistency between their comprehension ability and safety ability. That is, from the perspective of comprehension, MLLMs can robustly understand content regardless of visual style (e.g., "pencil sketch"). However, from the perspective of safety ability, their defense mechanisms can be easily bypassed by these specific stylistic triggers, leading to harmful responses. Based on this finding, we propose Adversarial Style Optimization (ASO), a plug-and-play enhancement module to amplify existing visual jailbreaks. ASO fine-tunes an image-editing model to superimpose an optimized stylistic modification onto a given adversarial image. We apply a Group Relative Policy Optimization (GRPO) agent, guided by a Structurally-Tiered Reward Function. This function uniquely combines a logit-based signal for detecting explicit refusals with a high-fidelity semantic evaluation from a powerful judge model, mapping outcomes to distinct, non-overlapping reward tiers to select the most potent stylistic parameters. Extensive experiments show that ASO significantly enhances the ASR of SOTA attacks. The GRPO agent automatically discovers optimal, non-intuitive parameters, demonstrating that stylistic biases are a scalable and modular vector for red-teaming MLLMs.
Show more
View full details
Poster

A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space

Huijie Liu ⋅ Shuhao Cui ⋅ Haoxiang Cao ⋅ Shuai Ma ⋅ Kai Wu ⋅ Guoliang Kang
Jun 5, 10:45 AM - 12:45 PM ExHall A-F
Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we consider the code-to-style image generation task, which aims to produce images with novel and consistent visual styles specified by only a numerical code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Extensive experiments validate that CoTyle effectively converts a numerical code into a style controller, demonstrating a style is worth one code. Compared to existing methods, the stylized images generated by our method are more diverse and consistent, unlocking a vast space of reproducible styles from minimal input.
Show more
View full details
Poster

Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models

Tao Qi ⋅ Huili Wang ⋅ Yuanhong Huang ⋅ Wendan Wang ⋅ Lianchao Zhao ⋅ Jinrui Wang ⋅ Zichen Qin ⋅ Shangguang Wang ⋅ Yongfeng Huang
Jun 5, 10:45 AM - 12:45 PM ExHall A-F
The rapid advancement of diffusion-based image generation models has raised serious concerns regarding potential copyright and privacy infringements involving human-created data.Membership inference attacks (MIAs) have emerged as a promising tool for identifying unauthorized data usage during model training.Existing methods typically assess the ability of model to denoise perturbed suspect images as an indicator of membership status.However, the discriminative power of such features is highly dependent on the degree of model memorization and deteriorates significantly when applied to less exposed data (e.g., pre-training data).Although several methods attempt to enhance detection by leveraging internal model features, these features are generally inaccessible in mainstream closed-source image generation platforms, limiting their practicality.In this paper, we demonstrate that analyzing how a black-box diffusion model denoises a target image and corresponding perturbed textual instructions can reveal more distinctive membership cues. Based on this insight, we propose a black-box membership inference attack framework (named SD-MIA) that leverages a cross-modal data perturbation mechanism to detect pre-training data in diffusion models.We conduct extensive experiments on both a public benchmark dataset and a newly constructed dataset, each comprising pre-training membership and non-membership samples with identical distributions. Experimental results demonstrate that SD-MIA achieves superior performance compared to existing baselines, including those with the unfair advantage of accessing internal model features.
Show more
View full details
Poster

ChordEdit: One-Step Low-Energy Transport for Image Editing

Liangsi Lu ⋅ Xuhang Chen ⋅ Minzhe Guo ⋅ Shichu Li ⋅ Jingchao Wang ⋅ Yang Shi
Jun 6, 11:45 AM - 1:45 PM ExHall F
The advent of one-step text-to-image (T2I) models offers unprecedented synthesis speed. However, their application to text-guided image editing remains severely hampered, as forcing existing training-free editors into a single inference step fails. This failure manifests as severe object distortion and a critical loss of consistency in non-edited regions, resulting from the high-energy, erratic trajectories produced by naive vector arithmetic on the models' structured fields. To address this problem, we introduce \textbf{ChordEdit}, a model agnostic, training-free, and inversion-free method that facilitates high-fidelity one-step editing. We recast editing as a transport problem between the source and target distributions defined by the source and target text prompts. Leveraging dynamic optimal transport theory, we derive a principled, low-energy control strategy. This strategy yields a smoothed, variance-reduced editing field that is inherently stable, facilitating the field to be traversed in a single, large integration step. A theoretically grounded and experimentally validated approach allows ChordEdit to deliver fast, lightweight and precise edits, finally achieving true real-time editing on these challenging models.
Show more
View full details
Poster

Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding

Yue Li ⋅ Qi Ma ⋅ Runyi Yang ⋅ Mengjiao Ma ⋅ Bin Ren ⋅ Nikola Popovic ⋅ Nicu Sebe ⋅ Theo Gevers ⋅ Luc Van Gool ⋅ Danda Paudel ⋅ Martin R. Oswald
Jun 6, 4:45 PM - 6:45 PM ExHall A & F
While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure.We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, as well as data-efficient supervision. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a variant using only Gaussians’ centers, colors, estimated normals as inputs. Interestingly, this encoder shows strong transfer and outperforms the point clouds baseline while using $39.9\times$ fewer training scenes. Finally, we propose a render-and-distill adaptation that facilitates out-of-domain finetuning. Our code and model will be released upon publication.
Show more
View full details
Poster

CineBrain: A Large-Scale Multi-Modal Audiovisual Brain Dataset for Brain-Conditioned Video Generation

Jianxiong Gao ⋅ Yichang Liu ⋅ baofeng yang ⋅ Jianfeng Feng ⋅ Yanwei Fu
Jun 7, 3:30 PM - 5:30 PM ExHall A
Most research decoding brain signals into images, often using them as priors for generative models, has focused only on visual content. This overlooks the brain's natural ability to integrate auditory and visual information, for instance, sound strongly influences how we perceive visual scenes. To investigate this,we propose a new task of reconstructing continuous video stimuli from multimodal brain signals recorded during audiovisual stimulation. To enable this, we introduce CineBrain, the first large-scale dataset that synchronizes fMRI and EEG during audiovisual viewing, featuring six hours of The Big Bang Theory episodes for cross-modal alignment. We also conduct the first systematic exploration of combining fMRI and EEG for video reconstruction and present CineSync, a framework for reconstructing dynamic video using a Multi-Modal Fusion Encoder and a Neural Latent Decoder. CineSync achieves state-of-the-art performance in dynamic reconstruction, leveraging the complementary strengths of fMRI and EEG to improve visual fidelity. Our analysis shows that auditory cortical activations enhance decoding accuracy, highlighting the role of auditory input in visual perception.
Show more
View full details
Poster

ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation

Huan Ren ⋅ Yihan Chen ⋅ Chuxin Wang ⋅ Nailong Liu ⋅ Wenfei Yang ⋅ Tianzhu Zhang
Jun 6, 11:45 AM - 1:45 PM ExHall F
Category-level object pose estimation aims to predict the pose and size of arbitrary objects in specific categories. Existing methods struggle with the inherent incompleteness of observed point clouds, which limits their ability to capture complete object shapes for robust pose reasoning. While point cloud completion offers a promising solution, naively treating it as a separate preprocessing step for partial observations introduces compounding errors and additional computational overhead, ultimately hindering both accuracy and efficiency.To address these challenges, we propose ComPose, a novel unified framework that tightly integrates shape completion to provide complete geometric cues for enhanced pose estimation. At the core of ComPose is a keypoint-based progressive completion module, which recovers full shape representations by progressively predicting a sparse set of keypoints and their surrounding dense point sets, empowering the keypoints to capture holistic object geometries. A geometric relation encoding module further enriches keypoint features with both local and global geometric context. In addition, we introduce a novel geometric relation consistency loss to enforce structural alignment between observed keypoints and their predicted NOCS coordinates, ensuring globally coherent coordinate transformations.Extensive experiments on standard benchmarks demonstrate that our method outperforms state-of-the-art approaches without relying on category-level shape priors. Our method pioneers a new direction for future research by effectively and efficiently integrating shape completion into category-level object pose estimation. Code will be open.
Show more
View full details
Poster

CoSMo3D: Open-World Promptable 3D Semantic Segmentation through LLM-Guided Canonical Spatial Modeling

Li Jin ⋅ Weikai Chen ⋅ Yujie Wang ⋅ Yingda Yin ⋅ Zeyu HU ⋅ Runze Zhang ⋅ Keyang Luo ⋅ Shengju Qian ⋅ Xin Wang ⋅ Xueying Qin
Jun 6, 11:45 AM - 1:45 PM ExHall F
Open-world promptable 3D semantic segmentation remains brittle as semantics are inferred in the input sensor coordinates. Yet, humans, in contrast, interpret parts via functional roles in a canonical space -- wings extend laterally, handles protrude to the side, and legs support from below. Psychophysical evidence shows that we mentally rotate objects into canonical frames to reveal these roles. To fill this gap, we propose CoSMo3D, which attains canonical space perception by inducing a latent canonical reference frame learned directly from data. By construction, we create a unified canonical dataset through LLM-guided intra- and cross-category alignment, exposing canonical spatial regularities across 200 categories. By induction, we realize canonicality inside the model through a dual-branch architecture with canonical map anchoring and canonical box calibration, collapsing pose variation and symmetry into a stable canonical embedding. This shift from input pose space to canonical representation yields far more stable and transferable part semantics. Experimental results show that CoSMo3D establishes new state of the art in open-world promptable 3D segmentation.
Show more
View full details
Poster

Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion

Zengyi Yang ⋅ Yu Liu ⋅ Juan Cheng ⋅ Zhiqin Zhu ⋅ Yafei Zhang ⋅ Huafeng Li
Jun 5, 10:45 AM - 12:45 PM ExHall A-F
Infrared-visible image fusion aims to integrate complementary information for robust visual understanding, but existing fusion methods struggle with simultaneously adapting to multiple downstream tasks. To address this issue, we propose a Closed-Loop Dynamic Network (CLDyN) that can adaptively respond to the semantic requirements of diverse downstream tasks for task-customized image fusion. Specifically, CLDyN introduces a closed-loop optimization mechanism that establishes a semantic transmission chain to achieve explicit feedback from downstream tasks to the fusion network through a Requirement-driven Semantic Compensation (RSC) module. The RSC module leverages a Basis Vector Bank (BVB) and an Architecture-Adaptive Semantic Injection (A2SI) block to customize the network architecture according to task requirements, thereby enabling task-specific semantic compensation and allowing the fusion network to actively adapt to diverse tasks without retraining. To promote accurate semantic compensation, a reward-penalty strategy is introduced to reward or penalize the RSC module based on task performance variations. Experiments on the M3FD, FMB, and VT5000 datasets demonstrate that CLDyN not only maintains high fusion quality but also exhibits strong multi-task adaptability.
Show more
View full details
Poster

Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets

Yeshwanth Kumar Adimoolam ⋅ Charalambos Poullis ⋅ Melinos Averkiou
Jun 5, 10:45 AM - 12:45 PM ExHall A-F
In our study, we conducted a comprehensive analysis of three widely used datasets in the domain of building footprint extraction using deep neural networks: the INRIA Aerial Image Labelling dataset, SpaceNet 2: Building Detection v2, and the AICrowd Mapping Challenge datasets. Our experiments revealed several issues in the AICrowd Mapping Challenge dataset, where nearly 90% (about 250k) of the training split images had identical copies, indicating a high level of duplicate data. Additionally, we found that approximately 56k of the 60k images in the validation split were also present in the training split, amounting to a 93% data leakage.Furthermore, we present a data validation pipeline to address these issues of duplication and data leakage, which hinder the performance of models trained on such datasets. Employing perceptual hashing techniques, this pipeline is designed for efficient de-duplication and leakage identification. It aims to thoroughly evaluate the quality of datasets before their use, thereby ensuring the reliability and robustness of the trained models.
Show more
View full details
Poster

Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression

SHIYIN JIANG ⋅ Wei Long ⋅ Minghao Han ⋅ Zhenghao Chen ⋅ Ce Zhu ⋅ Shuhang Gu
Jun 6, 11:45 AM - 1:45 PM ExHall F
The proliferation of visual data under tight storage and bandwidth budgets makes extremely low–bitrate generative image compression increasingly important. Vector quantization (VQ) is compelling in this regime because codebooks encode cross-channel correlations and dataset-level semantics, enabling perceptually faithful reconstructions when bits are scarce. We propose RDVQ, a vector-quantization (VQ) based generative image compression method designed for extremely low bitrates. While end-to-end learned image codecs rely on a differentiable rate term for rate–distortion (RD) optimization, however, a key challenge is that naïvely integrating VQ introduces non-differentiability and is not directly compatible with entropy modeling, forcing prior work to regulate bitrate only indirectly. We resolve this by defining a distance-aware soft posterior over codebook indices and training a conditional autoregressive entropy model to predict it. Therefore the cross-entropy between the approximate and predicted posteriors yields a differentiable rate loss, restoring a gradient pathway from rate to the encoder via codeword distances. Such predicted codebook index distribution enables prefix-only transmission at inference, with the model imputing the rest of the indices, delivering retraining-free bitrate control over a practical range. Our end-to-end RD optimized RDVQ outperforms all baseline methods in terms of DISTS and CLIPIQA, which reflect superior structural restoration and better alignment with human visual perception on the Kodak, DIV2K and CLIC2020 datasets.
Show more
View full details
Poster

Dual Band Thermal Videography: Separating Time-Varying Reflection and Emission Near Ambient Conditions

Sriram Narayanan ⋅ Mani Ramanagopal ⋅ Srinivasa G. Narasimhan
Jun 5, 10:45 AM - 12:45 PM ExHall A-F
Long-wave infrared radiation captured by a thermal camera includes (a) emission from an object governed by its temperature and emissivity, and (b) reflected radiation from the surrounding environment. Separating these components is a long-standing challenge in thermography. Even when using multiple bands, the problem is under-determined without priors on emissivity. This difficulty is amplified in near ambient conditions, where emitted and reflected signals are of comparable magnitude. We present a dual-band video thermography framework that reduces this ambiguity by combining two complementary ideas at a per-pixel level: (i) spectral cues (ratio of emissivity between bands is unknown but fixed), and (ii) temporal cues (object radiation changes smoothly while background radiation changes rapidly). We derive an image formation model and an algorithm to jointly estimate the object's emissivity at each band, and the time-varying object and background temperatures. Experiments with calibrated and uncalibrated emissivities in everyday scenes (e.g., coffee pot heating up, palm print on mirrors dissipating, reflections of moving people), demonstrate robust separation and recovery of temperature fields. We will release code and data upon acceptance.
Show more
View full details
Poster

Dual-level Adapter Boosting Prompt-free Curvilinear Structure Segmentation

Kai Zhu ⋅ Li Chen ⋅ Jun Cheng
Jun 7, 3:30 PM - 5:30 PM ExHall A
Curvilinear structure segmentation is essential in domains such as medical imaging, remote sensing, and materials science. Existing methods often require extensive domain-specific training and lack generalization to novel domains. To overcome these limitations, we propose the Segment Anything Curve Model (SACM) — a universal, curvilinear segmentation framework built upon the pretrained Segment Anything Model (SAM). SACM introduces a dual-level adapter architecture that enables both fine-grained and domain-adaptive enhancement: block-level internal adapters refine local structural representations, while external adapters facilitate cross-domain feature alignment. Specifically, the internal adapters are embedded within each Transformer block to locally adapt and refine features for thin and intricate curvilinear patterns, while the external adapters operate across blocks to capture global, multi-layer contextual information and facilitate domain adaptation. Furthermore, SACM introduces a feature fusion mechanism that aggregates multi-layer features from all external adapters and fuses them via a Feed-Forward Network (FFN) module, and a dual-stage refinement process in the mask decoder to enhance topology and connectivity. This design enables prompt-free, data-efficient fine-tuning and achieves robust cross-domain generalization when trained with only 18 annotated images. Extensive experiments across twelve diverse curvilinear datasets validate that SACM achieves state-of-the-art performance.
Show more
View full details
Poster

Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Chuhan Zhang ⋅ Guillaume Le Moing ⋅ Skanda Koppula ⋅ Ignacio Rocco ⋅ Liliane Momeni ⋅ Junyu Xie ⋅ Shuyang Sun ⋅ Rahul Sukthankar ⋅ Joëlle K. Barral ⋅ Raia Hadsell ⋅ Zoubin Ghahramani ⋅ Andrew Zisserman ⋅ Junlin Zhang ⋅ Mehdi S. M. Sajjadi
Jun 5, 4:00 PM - 6:00 PM ExHall A & F
Understanding and reconstructing the complex geometry and motion of dynamic 4D scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward network designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our unified decoding interface allows the model to independently and efficiently probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state-of-the-art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks.
Show more
View full details
Poster

Efficient Unrolled Networks for Large-Scale 3D Inverse Problems

Romain Vo ⋅ Julián Tachella
Jun 7, 3:30 PM - 5:30 PM ExHall A
Deep learning-based methods have revolutionized the field of imaging inverse problems, yielding state-of-the-art performance across various imaging domains. The best performing networks incorporate the imaging operator within the network architecture, typically in the form of deep unrolling. However, in large-scale problems, such as 3D imaging, most existing methods fail to incorporate the operator in the architecture due to the prohibitive amount of memory required by global forward operators, which hinder typical patching strategies. In this work, we present a domain partitioning strategy and normal operator approximations that enable the training of end-to-end reconstruction models incorporating forward operators of arbitrarily large problems into their architecture. The proposed method achieves state-of-the-art performance on 3D X-ray cone-beam tomography and 3D multi-coil accelerated MRI, while requiring only a single GPU for both training and inference.
Show more
View full details
Poster

Evidential Neural Radiance Fields

Ruxiao Duan ⋅ Alex Wong
Jun 7, 11:45 AM - 1:45 PM ExHall F
Understanding sources of uncertainty is fundamental to trustworthy three-dimensional scene modeling. While recent advances in neural radiance fields (NeRFs) achieve impressive accuracy in scene reconstruction and novel view synthesis, the lack of uncertainty estimation significantly limits their deployment in safety-critical settings. Existing uncertainty quantification methods for NeRFs fail to capture both aleatoric and epistemic uncertainty. Among those that do quantify one or the other, many of them either compromise rendering quality or incur significant computational overhead to obtain uncertainty estimates. To address these issues, we introduce Evidential Neural Radiance Fields, a probabilistic approach that seamlessly integrates with the NeRF rendering process and enables direct quantification of both aleatoric and epistemic uncertainty from a single forward pass. We compare multiple uncertainty quantification methods on three standardized benchmarks, where our approach demonstrates state-of-the-art scene reconstruction fidelity and uncertainty estimation quality.
Show more
View full details
Poster

FedAdamom: Adaptive Momentum for Improved Generalization in Federated Optimization

Wenjie Hou ⋅ Tianxiang Chen ⋅ Feng Wang ⋅ Tiantong Wu ⋅ Zhiming Zheng ⋅ Shaoting Tang ⋅ Wei Yang Bryan Lim
Jun 7, 3:30 PM - 5:30 PM ExHall A
Federated learning (FL) has emerged as a widely adopted training paradigm for privacy-preserving machine learning. Despite the past success of SGD-based methods, they still suffer from severe data heterogeneity and the lack of adaptivity in practical applications. While several adaptive federated optimization methods (such as FedAdam) have been proposed and demonstrated to achieve faster convergence, they fail to show significant improvements in generalization performance under highly heterogeneous data distributions, and their optimization and generalization mechanisms remain insufficiently understood. To fill this gap, we introduce diffusion theory into the adaptive federated optimization framework and analyze the distinct effects of adaptive learning rate and global momentum from the perspectives of saddle-point escaping and flat-minima selection. Theoretical results show that although FedAdam outperforms FedAvg/FedAvgM in escaping saddle points, the latter escapes sharp minima more efficiently. The root cause lies in that adaptive learning rates, while enhancing saddle-point escape, weaken the preference for flat minima. Motivated by these insights, we propose FedAdamom, a new adaptive federated optimization algorithm that adapts the momentum hyperparameter rather than the learning rate. FedAdamom maintains strong saddle-point escaping capability while enhancing flat-minima selection. We further establish its convergence guarantees under non-convex objectives. Extensive experiments demonstrate that FedAdamom significantly outperforms existing adaptive federated optimization methods in terms of convergence speed, generalization performance, and preference for flat minima.
Show more
View full details
Poster

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

Rui Xiao ⋅ Sanghwan Kim ⋅ Yongqin Xian ⋅ Zeynep Akata ⋅ Stephan Alaniz
Jun 6, 11:45 AM - 1:45 PM ExHall F
Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce **FI**ne-grained **NE**gative que**R**ies (**FINER**), alongside two benchmarks: **FINER-CompreCap** and **FINER-DOCCI**. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and “what” questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose **FINER-Tuning**, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2% gains (InternVL3.5-14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Benchmarks, training data, code and model checkpoints will be released.
Show more
View full details
Poster

FUSER: Feed-Forward Multiview 3D Registration Transformer and SE(3)^N Diffusion Refinement

Haobo Jiang ⋅ Jin Xie ⋅ Jian Yang ⋅ Liang Yu ⋅ Jianmin Zheng
Jun 5, 4:00 PM - 6:00 PM ExHall A & F
Registration of multiview point clouds typically depends on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and ill-posed without holistic geometric constraints. In this paper, we propose FUSER, the first feed-forward multi-view registration transformer that processes all scans jointly in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER employs a sparse 3D CNN to encode each scan into low-resolution superpoint features preserving absolute translation cues, followed by a Geometric Alternating Attention module for efficient intra- and inter-scan reasoning. Particularly, we transfer 2D attention priors from off-the-shelf foundation models (i.e., $\pi^3$) to enhance 3D feature attention. Building upon FUSER and its estimates, we further introduce FUSER-DF, an SE(3) diffusion refinement framework to correct FUSER's estimates through a denoising process over the joint SE(3)$^N$ space. Here, FUSER serves as a surrogate multiview register to model the denoiser, and a prior-conditioned SE(3)$^N$ variational lower bound is derived for denoising supervision. Extensive experiments on 3DMatch and ScanNet confirm the superior registration accuracy and efficiency of our method.
Show more
View full details
Poster

GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials

Bei Huang ⋅ Yixin Chen ⋅ Ruijie Lu ⋅ Gang Zeng ⋅ Hongbin Zha ⋅ Yuru Pei ⋅ Siyuan Huang
Jun 6, 4:45 PM - 6:45 PM ExHall A & F
3D Gaussian Splatting (3DGS) has emerged as a prominent 3D representation for high-fidelity and real-time rendering. Prior work has coupled physics simulation with Gaussians, but predominantly targets soft, deformable materials, leaving brittle fracture largely unresolved. This stems from two key obstacles: the lack of volumetric interiors with coherent textures in GS representation, and the absence of fracture-aware simulation methods for Gaussians. To address these challenges, we introduce GaussianFluent, a unified framework for realistic simulation and rendering of dynamic object states. First, it synthesizes photorealistic interiors by densifying internal Gaussians guided by generative models. Second, it integrates an optimized Continuum Damage Material Point Method (CD-MPM) to enable brittle fracture simulation at remarkably high speed. Our approach handles complex scenarios including mixed-material objects and multi-stage fracture propagation, achieving results infeasible with previous methods. Experiments clearly demonstrate GaussianFluent's capability for photo-realistic, real-time rendering with structurally consistent interiors, highlighting its potential for downstream application, such as VR and Robotics.
Show more
View full details
Poster

GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding

Peirong Zhang ⋅ Yidan Zhang ⋅ Luxiao Xu ⋅ Jinliang Lin ⋅ Zonghao Guo ⋅ Fengxiang Wang ⋅ Xue Yang ⋅ Kaiwen Wei ⋅ Lei Wang
Jun 6, 11:45 AM - 1:45 PM ExHall F
Recent advances in multimodal large language models (MLLMs) have led to remarkable progress in visual grounding, enabling fine-grained cross-modal alignment between textual queries and image regions. However, transferring such capabilities to remote sensing imagery remains challenging, as targets are often extremely small within kilometer-scale scenes, and queries typically involve intricate geospatial relations such as relative positions, spatial hierarchies, or contextual dependencies across distant objects.To address these challenges, we propose GeoViS, a Geospatially Rewarded Visual Search framework that reformulates remote sensing visual grounding as a progressive search-and-reasoning process. Rather than directly predicting the target location in a single step, GeoViS actively explores the global image through a tree-structured sequence of visual cues, integrating multimodal perception, spatial reasoning, and reward-guided exploration to refine geospatial hypotheses iteratively. This design enables the model to detect subtle small-scale targets while maintaining holistic scene awareness.Extensive experiments on five remote sensing grounding benchmarks demonstrate that GeoViS achieves precise geospatial understanding and consistently surpasses existing methods across key visual grounding metrics, highlighting its strong cross-domain generalization and interpretability.
Show more
View full details
Poster

GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport

Youngju Na ⋅ Jaeseong Yun ⋅ Soohyun Ryu ⋅ Hyunsu Kim ⋅ Sung-Eui Yoon ⋅ Suyong Yeon
Jun 5, 4:00 PM - 6:00 PM ExHall A & F
While 3D Gaussian splatting has emerged as a powerful paradigm, it fundamentally fails to model transparency such as glass panels, which are prevalent in everyday environments. The core challenge lies in decoupling the intertwined radiance contributions from transparent interfaces and the transmitted geometry observed through the glass. We present GLINT, a framework that models scene-scale transparency through explicit decomposed Gaussian representation. GLINT reconstructs the primary interface and separates outgoing radiance into reflection and transmission components according to its optical properties, enabling coherent Gaussian radiance transport. During the optimization, GLINT bootstraps transparency localization by utilizing geometry separation cues that emerge from our decomposition with the geometry and material priors from a pre-trained video relighting model. Extensive experiments demonstrate that GLINT achieves state-of-the-art performance in 3D reconstruction of complex transparent scenes.Our code will be released publicly.
Show more
View full details
Poster

Global-Aware Edge Prioritization for Pose Graph Initialization

Tong Wei ⋅ Giorgos Tolias ⋅ Jiri Matas ⋅ Daniel Barath
Jun 7, 11:45 AM - 1:45 PM ExHall F
The pose graph is a core component of Structure-from-Motion (SfM), where images act as nodes and edges encode relative poses. Since geometric verification is expensive, SfM pipelines restrict the pose graph to a sparse set of candidate edges, making initialization critical. Existing methods rely on image retrieval to connect each image to its $k$ nearest neighbors, treating pairs independently and ignoring global consistency. We address this limitation through the concept of edge prioritization, ranking candidate edges by their utility for SfM. Our approach has three components: (1) a GNN trained with SfM-derived supervision to predict globally consistent edge reliability; (2) multi-minimal-spanning-tree-based pose graph construction guided by these ranks; and (3) connectivity-aware score modulation that reinforces weak regions and reduces graph diameter. This globally informed initialization yields more reliable and compact pose graphs, improving reconstruction accuracy in sparse and high-speed settings and outperforming SOTA retrieval methods on ambiguous scenes. Code and models will be released.
Show more
View full details
Poster

Hearing the Room Through the Shape of the Drum: Modal-Guided Sound Recovery from Multi-Point Surface Vibrations

Shai Bagon ⋅ Matan Kichler ⋅ Mark Sheinin
ExHall A
Optical vibration sensing enables recovering the scene sound directly from the surface vibration of nearby objects, turning everyday objects into ``visual microphones''. However, most prior methods had focused on capturing the vibrations of specific objects with highly favorable vibration responses. These include objects where the surface vibrations are generated by the object itself (e.g., speaker membrane or guitar body) or objects consisting of a thin membrane which is highly reactive to sound (e.g., a chip bag or the leaf of a plant).In this paper, we tackle sound recovery for a more challenging class of solid objects whose vibration responses are poor or highly resonant. We simultaneously capture vibrations for multiple surface points on the object using a speckle-based vibrometry imaging system. Then, we derive a novel physics-guided vibration formation model that relates the scene sound source to the captured multi-point multi-axis vibrations via the object's vibrational modes. The model is then used to reverse the resonant transfer function of the vibrating object, fusing the plurality of vibration signals to estimate the original sound source of the scene. We evaluate our approach by recovering sound from a variety of everyday objects, demonstrating that it significantly outperforms traditional single-point speckle vibrometry in challenging scenarios where it performs poorly.
Show more
View full details
Poster

ImmerIris: A Large-Scale Dataset and Benchmark for Off-Axis and Unconstrained Iris Recognition in Immersive Applications

Yuxi Mi ⋅ Qiuyang Yuan ⋅ Zhizhou Zhong ⋅ Xuan Zhao ⋅ Jiaogen Zhou ⋅ Fubao Zhu ⋅ Jihong Guan ⋅ Shuigeng Zhou
Jun 7, 11:45 AM - 1:45 PM ExHall F
Recently, iris recognition is regaining prominence in immersive applications such as extended reality as a means of seamless user identification. This application scenario introduces unique challenges compared to traditional iris recognition under controlled setups, as the ocular images are primarily captured off-axis and less constrained, causing perspective distortion, intra-subject variation, and quality degradation in iris textures. Datasets capturing these challenges remain limited. This paper fills this gap by presenting a large-scale iris dataset collected via head-mounted displays, termed ImmerIris. It contains 499,791 ocular images from 564 subjects, and is, to our knowledge, the largest public iris dataset to date and among the first dedicated to immersive applications. It is accompanied by a comprehensive set of evaluation protocols that benchmark recognition systems under various challenging conditions. This paper also draws attention to a shared obstacle of current recognition methods, the reliance on a pre-processing, normalization stage, which is fallible in off-axis and unconstrained setups. To this end, this paper further proposes a normalization-free paradigm that directly learns from minimally adjusted ocular images. Despite its simplicity, it outperforms normalization-based prior arts, indicating a promising direction for robust iris recognition.
Show more
View full details
Poster

LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing

Yuanming Cao ⋅ Chengqi Li ⋅ Wenbo He
Jun 5, 10:45 AM - 12:45 PM ExHall A-F
Local Differential Privacy (LDP) is the gold standard trust model for privacy-preserving machine learning by guaranteeing privacy at the data source. However, its application to image data has long been considered impractical due to the high dimensionality of pixel space. Canonical LDP mechanisms are designed for low-dimensional data, resulting in severe utility degradation when applied to high-dimensional pixel spaces. This paper demonstrates that this utility loss is not inherent to LDP, but from its application to an inappropriate data representation. We introduce LDP-Slicing, a lightweight, training-free framework that resolves this domain mismatch. Our key insight is to decompose pixel values into a sequence of binary bit-planes. This transformation allows us to apply the LDP mechanism directly to the bit-level representation. To further strengthen privacy and preserve utility, we integrate a perceptual obfuscation module that mitigates human-perceivable leakage and an optimization-based privacy budget allocation strategy. This pipeline satisfies rigorous pixel-level $\varepsilon$-LDP while producing images that retain high utility for downstream tasks. Extensive experiments on face recognition and image classification demonstrate that LDP-Slicing outperforms existing DP/LDP baselines under comparable privacy budgets, with negligible computational overhead.
Show more
View full details
Poster

Learning Diffeomorphism for Medical Image Registration with Time-Embedded Architectures Using Semigroup Regularization

Mohammadjavad Matinkia ⋅ Nilanjan Ray
Jun 7, 11:45 AM - 1:45 PM ExHall F
Diffeomorphic image registration (DIR) seeks topology-preserving transformations and is fundamental in medical imaging. Existing DIR methods rely on integration schemes (e.g., scaling-and-squaring) and multiple regularizers to enforce invertibility. We introduce **SGDIR**, a continuous-time registration framework, parameterized by known time-embedded backbones, that models diffeomorphisms using only a single semigroup-based regularization, eliminating explicit integration and auxiliary constraints. We mathematically prove that this formulation directly learns the flow of an underlying ODE, inherently enforcing inverse and cycle consistencies. We evaluate on eight 2D and 3D MR and CT datasets. Under strict semigroup enforcement, our model achieves near-perfect diffeomorphism (near-zero folding) and significantly outperforms existing diffeomorphic methods, while remaining competitive with leading non-diffeomorphic deformable models. When the regularization is relaxed, the same architecture functions as a deformable method and substantially surpasses state-of-the-art non-diffeomorphic approaches in registration accuracy. These results demonstrate that continuous-time deformation modeling, guided solely by our semigroup-based regularization, yields a unified framework capable of both rigorously diffeomorphic mapping and state-of-the-art deformable registration.
Show more
View full details
Poster

Learning Eigenstructures of Unstructured Data Manifolds

Roy Velich ⋅ Arkadi Piven ⋅ David Bensaid ⋅ Daniel Cremers ⋅ Thomas Dagès ⋅ Ron Kimmel
Jun 7, 3:30 PM - 5:30 PM ExHall A
We introduce a novel framework that directly learns a spectral basis for shape and manifold analysis from unstructured data, eliminating the need for traditional operator selection, discretization, and eigensolvers.Grounded in optimal-approximation theory, we train a network to decompose an implicit approximation operator by minimizing the reconstruction error in the learned basis over a chosen distribution of probe functions. For suitable distributions, they can be seen as an approximation of the Laplacian operator and its eigendecomposition, which are fundamental in geometry processing. Furthermore, our method recovers in a unified manner not only the spectral basis, but also the implicit metric's sampling density and the eigenvalues of the underlying operator. Notably, our unsupervised method makes no assumption on the data manifold, such as meshing or manifold dimensionality, allowing it to scale to arbitrary datasets of any dimension.On point clouds lying on surfaces in 3D and high-dimensional image manifolds, our approach yields meaningful spectral bases, that can resemble those of the Laplacian, without explicit construction of an operator. By replacing the traditional operator selection, construction, and eigendecomposition with a learning-based approach, our framework offers a principled, data-driven alternative to conventional pipelines. This opens new possibilities in geometry processing for unstructured data, particularly in high-dimensional spaces.
Show more
View full details
Poster

Learning Latent Concepts for Detecting Out-of-Distribution Objects

Ting Peng ⋅ Junhao Dong ⋅ Yew-Soon Ong
Jun 7, 11:45 AM - 1:45 PM ExHall F
Detecting out-of-distribution (OOD) objects is indispensable for safely deploying object detectors in the wild. Current approaches enable the unknown-aware ability by regularizing the instance-level feature space, such as outlier synthesis. Despite the general efficacy, it is challenging to truly learn the concept of `unknown' under the absence of real unknown data. In this paper, we propose UNO-Adapter, a simple yet highly effective framework tailored for OOD object detection. Our key insight is that in object detection, where in-distribution~(ID) and OOD objects may coexist within the same context, we need global abstraction and reasoning to help the detector learn their differences, i.e., unknown injection. UNO-Adapter consists of two key steps: unsupervised concept discovery and neural concept binder. The former introduces an object-centric learning paradigm to abstract and model the holistic image, including both ID and OOD, obtaining sparse and compressed slot-based representations with relational constraints. The latter dynamically combines slots with object candidates extracted by the detector, binding the concept of unknown to the de facto detector. During inference, we introduce an image-guided OOD object score to reinforce the distinction between ID and OOD. Experiments on standard benchmarks demonstrate the superiority of the proposed method. In particular, UNO-Adapter reduces the FPR95 by up to 11.96% compared to the previous best OOD object detection method.
Show more
View full details
Poster

Linear Fundamental Matrix Estimation from 7 or 5 Points

Taci Ata Kucukpinar ⋅ Juan Mogollon ⋅ Joshua Fraser ⋅ Timothy Duff ⋅ Kannappan Palaniappan
Jun 6, 4:45 PM - 6:45 PM ExHall A & F
We revisit the problem of estimating the fundamental matrix of a pair of perspective cameras, a cornerstone of geometric computer vision.As is well-known, linear solvers require at least 8 point correspondences, whereas nonlinear minimal solvers require just 7 in the uncalibrated case or 5 in the calibrated case.In this paper, we consider a special case of the 7-point problem where 5 of the points are configured to lie on two lines, which has previously been shown to have a unique solution.As a theoretical contribution, we offer an analysis of how this uniqueness manifests in the standard 7-point algorithm. On a practical level, we provide the first practical linear solver for the minimal problem associated to this special configuration.Additionally, we evaluate a heuristic 5-point fundamental matrix solver based on the construction of virtual midpoints.When combined with early non-minimal fitting, the runtime and accuracy of our solver is competitive with the state-of-the-art (SoTA) on multiple benchmarks.
Show more
View full details
Poster

MAMMA: Markerless Accurate Multi-person Motion Acquisition

Hanz Cuevas Velasquez ⋅ Anastasios Yiannakidis ⋅ Soyong Shin ⋅ Giorgio Becherini ⋅ Markus Höschle ⋅ Joachim Tesch ⋅ Taylor Obersat ⋅ Tsvetelina Alexiadis ⋅ Eni Halilaj ⋅ Michael J. Black
Jun 5, 4:00 PM - 6:00 PM ExHall A & F
We present MAMMA, a markerless motion-capture pipeline that accurately recovers SMPL-X parameters from multi-view video.Traditional motion-capture systems rely on physical markers. Although they offer high accuracy, their requirements of specialised hardware, manual marker placement, and extensive post-processing make them costly and time-consuming. Recent learning-based methods attempt to overcome these limitations, but most are designed for single-person capture, rely on sparse keypoints, or struggle with occlusions and physical interactions. In this work, we introduce a method that predicts dense 2D surface landmarks conditioned on segmentation masks, enabling person-specific correspondence estimation even under heavy occlusion. We employ a novel architecture that exploits learnable queries for each landmark. We demonstrate that our approach can handle complex person--person interaction and offers greater accuracy than existing methods. To train our network, we construct a large, synthetic multi-view dataset combining human motions from diverse sources, including extreme poses, hand motions, and close interactions. Our dataset yields high-variability synthetic sequences with rich body contact and occlusion, and includes SMPL-X ground-truth annotations with dense 2D landmarks.The result is a system capable of accurately capturing human motion without the need for markers. Our approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions, without the extensive manual cleanup. Finally, we address the absence of common benchmarks for dense-landmark prediction and markerless motion capture by introducing two evaluation settings built from real multi-view sequences. We will release our dataset, method, code, and model weights for research purposes.
Show more
View full details
Poster

Mapping Networks

Lord Sen ⋅ Shyamapada Mukherjee
Jun 7, 3:30 PM - 5:30 PM ExHall A
The escalating parameter counts in modern deep learningmodels pose a fundamental challenge to efficient trainingand resolution of overfitting. We address this by introducingthe Mapping Networks which replace the high dimensionalweight space by a compact, trainable latent vector based onthe hypothesis that the trained parameters of large networksreside on smooth, low-dimensional manifolds. Henceforth,the Mapping Theorem enforced by a dedicated MappingLoss, shows the existence of a mapping from this latentspace to the target weight space both theoretically and inpractice. Mapping Networks significantly reduce overfittingand achieve comparable to better performance than target network across complex vision and sequence tasks, including Image Classification, Deepfake Detection etc., with99.5%, i.e., around 500× reduction in trainable parameters.
Show more
View full details
Poster

Medic-AD: Towards Medical Vision-Language Model's Clinical Intelligence

Woohyeon Park ⋅ Jaeik Kim ⋅ Sunghwan Steve Cho ⋅ Pa Hong ⋅ Wookyoung Jeong ⋅ Yoojin Nam ⋅ Namjoon Kim ⋅ Ginny Y. Wong ⋅ Ka Chun Cheung ⋅ Jaeyoung Do
Jun 7, 3:30 PM - 5:30 PM ExHall A
Lesion detection, symptom tracking, and visual explainability are central to real-world medical image analysis, yet current medical Vision-Language Models (VLMs) still lack mechanisms that translate their broad knowledge into clinically actionable outputs. To bridge this gap, we present Medic-AD, a clinically oriented VLM that strengthens these three capabilities through a stage-wise framework. First, learnable anomaly-aware tokens (Ano) encourage the model to focus on abnormal regions and build more discriminative lesion centered representations. Second, inter-image difference tokens (Diff) explicitly encode temporal changes between studies, allowing the model to distinguish worsening, improvement, and stability in disease burden. Finally, a dedicated explainability stage trains the model to generate heatmaps that highlight lesion-related regions, offering clear visual evidence that is consistent with the model's reasoning. Through our staged design, Medic-AD steadily boosts performance across anomaly detection, symptom tracking, and anomaly segmentation, achieving state-of-the-art results compared with both closed source and medical-specialized baselines. Evaluations on real longitudinal clinical data collected from real hospital workflows further show that Medic-AD delivers stable predictions and clinically faithful explanations in practical patient-monitoring and decision-support workflows.
Show more
View full details
Poster

Memory-Augmented Scene Understanding and Exploration for Open-World Aerial Object-Goal Navigation

Jiacong Zhou ⋅ Jiaxu Miao ⋅ Yourun Lin ⋅ Xianyun Wang ⋅ Jun Xiao ⋅ Jun Yu
Jun 6, 4:45 PM - 6:45 PM ExHall A & F
Aerial object-goal navigation (Aerial ObjectNav) requires an Unmanned Aerial Vehicle (UAV) to navigate to target objects in large-scale outdoor environments using only visual observations and high-level object descriptions, without detailed step-by-step instructions. Existing approaches rely on local observations or short-term history, lacking comprehensive scene understanding and efficient spatial exploration strategies, which constrains their navigation capability in complex aerial scenarios. To address these challenges, we propose OctMem-Agent, an octree memory-augmented framework for aerial object-goal navigation. Specifically, we introduce an Adaptive Octree Memory that incrementally aggregates RGB-D observations into a hierarchical 3D representation, capturing both explored regions and unexplored frontiers across large-scale aerial environments. We further propose a Instruction-Guided Memory Query module that extracts task-relevant scene and exploration tokens through instruction-modulated queries. By integrating these tokens with visual observations and language instructions, OctoMem-Agent achieves comprehensive scene understanding and effective spatial exploration for target localization. Extensive experiments on the Aerial ObjectNav benchmark UAV-ON demonstrate that our method achieves a significant 7.5\% improvement in success rate over existing methods, validating the effectiveness of our design.
Show more
View full details
Poster

MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging

Yuxuan Liu ⋅ Wei Xu ⋅ Qi Guo
Jun 5, 10:45 AM - 12:45 PM ExHall A-F
We present MetaSpectra+, a compact multifunctional camera that supports two operating modes: (1) snapshot HDR + hyperspectral or (2) snapshot polarization + hyperspectral imaging. It utilizes a novel metasurface-refractive assembly that splits the incident beam into multiple channels and independently controls each channel’s dispersion, exposure, and polarization. Unlike prior multifunctional metasurface imagers restricted to narrow (10--100 nm) bands, MetaSpectra+ operates over nearly the entire visible spectrum (250 nm). Relative to snapshot hyperspectral imagers, it achieves the shortest total track length and the highest reconstruction accuracy on benchmark datasets. The demonstrated prototype reconstructs high-quality hyperspectral datacubes and either an HDR image or two orthogonal polarization channels from a snapshot measurement.
Show more
View full details
Poster

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Christopher Clark ⋅ Jieyu Zhang ⋅ Zixian Ma ⋅ Jae Sung Park ⋅ Rohun Tripathi ⋅ Sangho Lee ⋅ Reza Salehi ⋅ Jason Ren ⋅ Chris Dongjoo Kim ⋅ Yinuo Yang ⋅ Vincent Shao ⋅ Yue Yang ⋅ Weikai Huang ⋅ Ziqi Gao ⋅ Taira Anderson ⋅ Jianrui Zhang ⋅ Jitesh Jain ⋅ George Stoica ⋅ Ali Farhadi ⋅ Ranjay Krishna
Jun 7, 11:45 AM - 1:45 PM ExHall F
Today’s strongest video-language models (VLMs) remain proprietary.The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe.As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models.Crucially, many downstream applications require more than just high-level video understanding; they require grounding—either by pointing or by tracking in pixels. Even proprietary models lack this capability.We present Molmo2, a new family of VLMs that are state-of-the-art amongst open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks.Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs.We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme and show bi-directional attention on vision tokens and a novel token-weight strategy improve performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 outperforms larger proprietary models, including 32.9% (Molmo2) vs 17% (Gemini 2.5 Pro) on video pointing.
Show more
View full details
Poster

Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

Changqing Zhou ⋅ Yueru Luo ⋅ Han Zhang ⋅ Zeyu Jiang ⋅ Changhao Chen
Jun 6, 4:45 PM - 6:45 PM ExHall A & F
Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs. free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. On the semantic side, direct alignment between rendered features and open-vocabulary segmentation features suffers from feature mixing; we therefore propose a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, strengthening Gaussian–language alignment. On Occ-ScanNet, our framework achieves 59.50 IoU and 21.05 mIoU in the open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by a large margin in mIoU. Code will be released.
Show more
View full details
Poster

Native and Compact Structured Latents for 3D Generation

Jianfeng XIANG ⋅ Xiaoxue Chen ⋅ Sicheng Xu ⋅ Ruicheng Wang ⋅ Zelong Lv ⋅ Yu Deng ⋅ Hongyuan Zhu ⋅ Yue Dong ⋅ Hao Zhao ⋅ Nicholas Jing Yuan ⋅ Jiaolong Yang
Jun 6, 11:45 AM - 1:45 PM ExHall F
Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models comprising 4B parameters for 3D generation using diverse public 3D asset datasets. Despite their scale, inference remains highly efficient. Meanwhile, the geometry and material quality of our generated assets far exceed those of existing models. We believe our approach offers a significant advancement in 3D generative modeling.
Show more
View full details
Poster

Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos

Dingkun Wei ⋅ Zehong Shen ⋅ Yan Xia ⋅ Yujun Shen ⋅ Georgios Pavlakos ⋅ Xiaowei Zhou
Jun 5, 4:00 PM - 6:00 PM ExHall A & F
Human motion recovered from monocular videos often appears overly smooth or dynamically inconsistent, even when joint positions are numerically accurate. We observe that this limitation stems from the absence of reliable high-order temporal cues—velocity and acceleration—which are essential for reconstructing motion that exhibits realistic momentum, timing, and high-frequency detail.We introduce HTD-Refine, a post-processing framework that augments existing Human Motion Recovery (HMR) pipelines using explicitly estimated high-order temporal dynamics. At the core of our system is PVA-Net, a temporal transformer that infers per-joint 2D positions, velocities, and accelerations directly from a monocular video. These predicted dynamics serve as soft yet informative constraints in a global optimization procedure that refines camera-space and world-space trajectories, significantly reducing jitter, suppressing oversmoothing, and restoring physically plausible motion profiles.Extensive experiments on challenging in-the-wild benchmarks show that HTD-Refine consistently improves state-of-the-art HMR methods, yielding more accurate global trajectories and substantially more natural motion dynamics. Our results highlight the critical role of high-order temporal modeling in advancing monocular human motion recovery.
Show more
View full details
Poster

NitroGen: An Open Foundation Model for Generalist Gaming Agents

Loïc Magne ⋅ Anas Awadalla ⋅ Guanzhi Wang ⋅ Yinzhen Xu ⋅ Joshua Belofsky ⋅ Fengyuan Hu ⋅ Joohwan Kim ⋅ Ludwig Schmidt ⋅ Georgia Gkioxari ⋅ Jan Kautz ⋅ Yisong Yue ⋅ Yejin Choi ⋅ Yuke Zhu ⋅ Jim Fan
Jun 6, 4:45 PM - 6:45 PM ExHall A & F
We introduce NitroGen, a video-action foundation model for generalist gaming agents, trained on 40,000 hours of gameplay videos across more than 1000 games. We incorporate three key ingredients: 1) an internet-scale video-action dataset constructed by automatically extracting player actions from publicly available gameplay videos, 2) a multi-game benchmark environment that can measure cross-game generalization, and 3) a unified vision-action model trained with large-scale behavior cloning. NitroGen exhibits strong competence across diverse domains, including combat encounters in 3D action games, high-precision control in 2D platformers, and exploration in procedurally generated worlds. It transfers effectively to unseen games, achieving up to 52% relative improvement in success rates over models trained from scratch. We release the dataset, benchmark, and model weights to advance research on generalist embodied agents.
Show more
View full details
Poster

NuWa: Deriving Lightweight Class-Specific Vision Transformers for Edge Devices

Ziteng Wei ⋅ Qiang He ⋅ Bing Li ⋅ Feifei Chen ⋅ Hai Jin ⋅ Yun Yang
Jun 5, 10:45 AM - 12:45 PM ExHall A-F
Vision Transformers (ViTs) often need to be compressed for deployment on resource-constrained edge devices like drones and smart vehicles. However, existing model compression methods ignore that many edge devices only require the knowledge of specific classes for their applications. As a result, the derived all-class ViTs retain redundant knowledge and perform suboptimally on these classes. We discovered that simply replacing the calibration dataset with class-specific data does not suffice to address this issue, as these methods face two fundamental limitations. First, they overlook the existence of class-detrimental weights, which interfere with specialization, while removing them can improve class-specific performance. Second, the diversity of target classes and resource constraints on edge devices demand numerous customized models. Existing methods are time-consuming and computationally expensive, thus unscalable. In this work, we present NuWa, a cost-efficient method that addresses these challenges by deriving small ViTs from base ViTs for edge devices with specific class requirements. NuWa performs self-knowledge purification to prune class-detrimental weights and efficiently derives compact ViTs through closed-form optimization. Without post-pruning retraining, the derived edge ViTs surpass the base ViT in class-specific accuracy and accelerate inference. Comprehensive experiments demonstrate that NuWa outperforms state-of-the-art training-free pruning methods on class-specific tasks by up to 29.00\% in accuracy. Compared with the best-performing training-dependent pruning method, NuWa achieves a 33.69× pruning speedup and reduces pruning cost by up to 99.83\%, with only a 0.61\% average accuracy loss.
Show more
View full details
Poster

OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control

Xilong Zhou ⋅ Jianchun Chen ⋅ Pramod Rao ⋅ Timo Teufel ⋅ Linjie Lyu ⋅ Tigran Minasian ⋅ Oleksandr Sotnychenko ⋅ Xiaoxiao Long ⋅ Marc Habermann ⋅ Christian Theobalt
Jun 7, 11:45 AM - 1:45 PM ExHall F
We introduce OLATverse, a large-scale dataset comprising around 9M images of 765 real-world objects, captured from multiple viewpoints under a diverse set of precisely controlled lighting conditions. While recent advances in object-centric inverse rendering, novel view synthesis and relighting have shown promising results, most techniques still heavily rely on the synthetic datasets for training and small-scale real-world datasets for benchmarking, which limits their realism and generalization. To address this gap, OLATverse offers two key advantages over existing datasets: large-scale coverage of real objects and high-fidelity appearance under precisely controlled illuminations. Specifically, OLATverse contains 765 common and uncommon real-world objects, spanning a wide range of material categories. Each object is captured using 35 DSLR cameras and 331 individually controlled light sources, enabling the simulation of diverse illumination conditions. In addition, for each object, we provide well-calibrated camera parameters, accurate object masks, photometric surface normals, and diffuse albedo as auxiliary resources. We also construct an extensive evaluation set, establishing the first comprehensive real-world object-centric benchmark for inverse rendering and normal estimation. We believe that OLATverse represents a pivotal step toward integrating the next generation of inverse rendering and relighting methods with real-world data.
Show more
View full details
Poster

OpenDance: Multimodal Controllable 3D Dance Generation with Large-scale Internet Data

Jinlu Zhang ⋅ Zixi Kang ⋅ Libin Liu ⋅ Jianlong Chang ⋅ Qi Tian ⋅ Feng Gao ⋅ Yizhou Wang
Jun 7, 11:45 AM - 1:45 PM ExHall F
Music-driven 3D dance generation offers significant creative potential, yet practical applications demand versatile and multimodal control. As the highly dynamic and complex human motion covering various styles and genres, dance generation requires satisfying diverse conditions beyond just music (e.g., spatial trajectories, keyframe gestures, or style descriptions). However, the absence of a large-scale and richly annotated dataset severely hinders progress. In this paper, we build OpenDanceSet, an extensive human dance dataset comprising over 100 hours across 14 genres and 147 subjects. Each sample has rich annotations to facilitate robust cross-modal learning: 3D motion, paired music, 2D keypoints, trajectories, and expert-annotated text descriptions. Furthermore, we propose OpenDanceNet, a unified masked modeling framework for controllable dance generation, including a disentangled auto-encoder and a multimodal joint-prediction Transformer. OpenDanceNet supports generation conditioned on music and arbitrary combinations of text, keypoints, or trajectories. Comprehensive experiments demonstrate that our work achieves high-fidelity synthesis with strong diversity and realistic physical contacts, while also offering flexible control over spatial and stylistic conditions.
Show more
View full details
Poster

PAI-Bench: A Comprehensive Benchmark For Physical AI

Fengzhe Zhou ⋅ Jiannan Huang ⋅ Jialuo Li ⋅ Deva Ramanan ⋅ Humphrey Shi
Jun 6, 4:45 PM - 6:45 PM ExHall A & F
Physical AI aims to develop models that can perceive and predict real-world dynamics; yet, the extent to which current multi-modal large language models and video generative models support these abilities is insufficiently understood. We introduce Physical AI Bench (PAI-Bench), a unified and comprehensive benchmark that evaluates perception and prediction capabilities across video generation, conditional video generation, and video understanding, comprising 2,808 real-world cases with task-aligned metrics designed to capture physical plausibility and domain-specific reasoning. Our study provides a systematic assessment of recent models and shows that video generative models, despite strong visual fidelity, often struggle to maintain physically coherent dynamics, while multi-modal large language models exhibit limited performance in forecasting and causal interpretation. These observations suggest that current systems are still at an early stage in handling the perceptual and predictive demands of Physical AI. In summary, PAI-Bench establishes a realistic foundation for evaluating Physical AI and highlights key gaps that future systems must address.
Show more
View full details
Poster

PhyGaP: Physically-Grounded Gaussians with Polarization Cues

Jiale Wu ⋅ Xiaoyang Bai ⋅ Zongqi He ⋅ Weiwei Xu ⋅ YIFAN PENG
Jun 5, 4:00 PM - 6:00 PM ExHall A & F
Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated great success in modeling reflective 3D objects and their interaction with the environment via **deferred rendering (DR)**. However, existing methods often struggle with correctly reconstructing physical attributes such as albedo and reflectance, and therefore they do not support high-fidelity relighting. Observing that this limitation stems from the lack of **shape and material** information in RGB images, we present PhyGaP, a physically-grounded 3DGS method that leverages polarization cues to facilitate precise reflection decomposition and visually consistent relighting of reconstructed objects. Specifically, we design a polarimetric deferred rendering (PolarDR) process to model polarization by reflection, and a self-occlusion-aware environment map building technique (GridMap) to resolve indirect lighting of non-convex objects. We validate on multiple synthetic and real-world scenes, including those featuring only partial polarization cues, that PhyGaP not only excels in reconstructing the appearance and surface normal of reflective 3D objects (~2 dB in PSNR and 45.7% in Cosine Distance better than existing RGB-based methods on average), but also achieves state-of-the-art inverse rendering and relighting capability.
Show more
View full details
Poster

PixelDiT: Pixel Diffusion Transformers for Image Generation

Yongsheng Yu ⋅ Wei Xiong ⋅ Weili Nie ⋅ Yichen Sheng ⋅ Shiqiu Liu ⋅ Jiebo Luo
Jun 6, 11:45 AM - 1:45 PM ExHall F
Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. PixelDiT achieves 1.61 FID on ImageNet 256 and 2.21 FID on ImageNet 512, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the $1024^{2}$ resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.
Show more
View full details
Poster

Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species

Jinyu Xu ⋅ Tianqi Hu ⋅ Xiaonan Hu ⋅ Letian Zhou ⋅ Songliang Cao ⋅ Meng Zhang ⋅ Hao Lu
Jun 5, 10:45 AM - 12:45 PM ExHall A-F
Visually cataloging and quantifying the natural world requires pushing the boundaries of both detailed visual classification and counting at scale. Despite significant progress, particularly in crowd and traffic analysis, the fine-grained, taxonomy-aware plant counting remains underexplored in vision. In contrast to crowds, plants are complicated by nonrigid morphologies and physical appearance variations across growth stages and environments. Tofill this gap, we present TPC-268, the first plant counting benchmark taking plant taxonomy into account. Our dataset couples instance-level point annotations with complete Linnaean labels (kingdom$\rightarrow$species) and organ categories, enabling hierarchical reasoning and species-aware evaluation. The datasetfeatures $10,000$ images with $678,090$ point annotations, includes $268$ countable plant categories over $242$ plant species in Plantae and Fungi, and spans observation scales from canopy-level remote sensing imagery to tissue-level microscopy.We follow the problem setting of class-agnostic counting (CAC), provide taxonomy-consistent, scale-aware data splits, and benchmark state-of-the-art regression- and detection-based CAC approaches. By capturing the biodiversity, hierarchical structure, and multi-scale nature of botanical and mycological taxa, TPC-268 provides a biologically grounded testbed to advance fine-grained class-agnostic counting.
Show more
View full details
Poster

ProPhy: Progressive Physical Alignment for Dynamic World Simulation

Zijun Wang ⋅ Panwen Hu ⋅ Jing Wang ⋅ Terry Jingchen Zhang ⋅ Yuhao Cheng ⋅ Long Chen ⋅ Yiqiang Yan ⋅ Zutao Jiang ⋅ Hanhui Li ⋅ Xiaodan Liang
Jun 6, 11:45 AM - 1:45 PM ExHall F
Recent advances in video generation have shown remarkable potential for constructing world simulators. However, current models still struggle to produce physically consistent results, particularly when handling large-scale or complex dynamics. This limitation arises primarily because existing approaches respond isotropically to physical prompts and neglect the fine-grained alignment between generated content and localized physical cues. To address these challenges, we propose ProPhy, a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation. ProPhy employs a two-stage Mixture-of-Physics-Experts (MoPE) mechanism for discriminative physical prior extraction, where Semantic Experts infer semantic-level physical principles from textual descriptions, and Refinement Experts capture token-level physical dynamics. This mechanism allows the model to learn fine-grained, physics-aware video representations that better reflect underlying physical laws. Furthermore, we introduce a physical alignment strategy that transfers the physical reasoning capabilities of vision-language models (VLMs) into the Refinement Experts, facilitating a more accurate representation of dynamic physical phenomena. Extensive experiments on physics-aware video generation benchmarks demonstrate that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.
Show more
View full details
Poster

Proxy-GS: Unified Occlusion Priors for Training and Inference in Structured 3D Gaussian Splatting

Yuanyuan Gao ⋅ YUNING GONG ⋅ Yifei Liu ⋅ Jingfeng Li ⋅ Dan Xu ⋅ Yanci Zhang ⋅ Dingwen Zhang ⋅ Xiao Sun ⋅ Zhihang Zhong
Jun 5, 4:00 PM - 6:00 PM ExHall A & F
3D Gaussian Splatting (3DGS) has emerged as an efficient approach for achieving photorealistic rendering. Recent MLP-based variants further improve visual fidelity but introduce substantial decoding overhead during rendering. To alleviate computation cost, several pruning strategies and level-of-detail (LOD) techniques have been introduced, aiming to effectively reduce the number of Gaussian primitives in large-scale scenes. However, our analysis reveals that significant redundancy still remains due to the lack of occlusion awareness. In this work, we propose Proxy-GS, a novel pipeline that exploits a proxy to introduce Gaussian occlusion awareness from any view.At the core of our approach is a fast proxy system capable of producing precise occlusion depth maps at resolution 1000$\times$1000 under 1 ms. This proxy serves two roles: first, it guides the culling of anchors and Gaussians to accelerate rendering speed. Second, it guides the densification towards surfaces during training, avoiding inconsistencies in occluded regions, and improving the rendering quality. In heavily occluded scenarios, such as the MatrixCity Streets dataset, Proxy-GS not only equips MLP-based Gaussian splatting with stronger rendering capability but also achieves faster rendering speed than the original 3DGS. Specifically, it achieves more than $2.5\times$ speedup over Octree-GS, and consistently delivers substantially higher rendering quality.
Show more
View full details
Poster

QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition

Daniel Miao ⋅ Gilad Lerman ⋅ Joe Kileel
Jun 7, 11:45 AM - 1:45 PM ExHall F
In structure from motion, quadrifocal tensors capture more information than their pairwise counterparts (essential matrices), yet they have often been thought of as impractical and only of theoretical interest. In this work, we challenge such beliefs by providing a new framework to recover $n$ cameras from the corresponding collection of quadrifocal tensors. We form the block quadrifocal tensor and show that it admits a Tucker decomposition whose factor matrices are the stacked camera matrices, and which thus has a multilinear rank of (4,4,4,4) independent of $n$. We develop the first synchronization algorithm for quadrifocal tensors, using Tucker decomposition, alternating direction method of multipliers, and iteratively reweighted least squares. We further establish relationships between the block quadrifocal, trifocal, and bifocal tensors, and introduce an algorithm that jointly synchronizes these three entities. Numerical experiments demonstrate the effectiveness of our methods on modern datasets, indicating the potential and importance of using higher-order information in synchronization.
Show more
View full details
Poster

R^2-Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection

Shuaike Shen ⋅ Ke Liu ⋅ Jiaqing Xie ⋅ Shangde Gao ⋅ Chunhua Shen ⋅ Ge Liu ⋅ Mireia Crispin-Ortuzar ⋅ Shangqi Gao
Jun 6, 4:45 PM - 6:45 PM ExHall A & F
Foundation models for medical image segmentation struggle under out-of-distribution (OOD) shifts, often producing fragmented false positives on OOD tumors. We introduce **R$^2$-Seg**, a **training-free** framework for robust OOD tumor segmentation that operates via a two-stage **Reason-and-Reject** process. First, the **Reason** step employs an LLM-guided anatomical reasoning planner to localize organ anchors and generate multi-scale ROIs. Second, the **Reject** step applies two-sample statistical testing to candidates generated by a frozen foundation model (BiomedParse) within these ROIs. This statistical rejection filter retains only candidates significantly different from normal tissue, effectively suppressing false positives. Our framework requires no parameter updates, making it compatible with zero-update test-time augmentation and avoiding catastrophic forgetting. On multi-center and multi-modal tumor segmentation benchmarks, **R$^2$-Seg** substantially improves Dice, specificity, and sensitivity over strong baselines and the original foundation models.
Show more
View full details
Poster

RefAV: Towards Planning-Centric Scenario Mining

Cainan Davidson ⋅ Deva Ramanan ⋅ Neehar Peri
Jun 6, 4:45 PM - 6:45 PM ExHall A & F
Autonomous Vehicles (AVs) collect and pseudo-label terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries. In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of $10,000$ diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from $1000$ driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges. Lastly, we discuss our recently held competition and share insights from the community.
Show more
View full details
Poster

Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views

Kunwar Maheep Singh ⋅ Jianchun Chen ⋅ Vladislav Golyanik ⋅ Stephan Garbin ⋅ Thabo Beeler ⋅ Rishabh Dabral ⋅ Marc Habermann ⋅ Christian Theobalt
Jun 7, 11:45 AM - 1:45 PM ExHall F
We present _Relightable Holoported Characters_ (RHC), a novel person-specific method for free-view rendering and relighting of full-body and highly dynamic humans solely observed from sparse-view RGB videos at inference. In contrast to classical one-light-at-a-time (OLAT)-based human relighting, our transformer-based RelightNet predicts relit appearance within a single network pass, avoiding costly OLAT-basis capture and generation. For training such a model, we introduce a new capture strategy and dataset recorded in a multi-view lightstage, where we alternate frames lit by random environment maps with uniformly lit tracking frames, simultaneously enabling accurate motion tracking and diverse illumination as well as dynamics coverage. Inspired by the rendering equation, we derive physics-informed features that encode geometry, albedo, shading, and the virtual camera view from a coarse human mesh proxy and the input views. Our RelightNet then takes these features as input and cross-attends them with a novel lighting condition, and regresses the relit appearance in the form of texel-aligned 3D Gaussian splats attached to the coarse mesh proxy. Consequently, our RelightNet implicitly learns to efficiently compute the rendering equation for novel lighting conditions within a single feed-forward pass. Experiments demonstrate our method’s superior visual fidelity and lighting reproduction compared to state-of-the-art approaches.
Show more
View full details
Poster

Residual Primitive Fitting of 3D Shapes with SuperFrusta

Aditya Ganeshan ⋅ Matheus Gadelha ⋅ Thibault Groueix ⋅ Zhiqin Chen ⋅ Siddhartha Chaudhuri ⋅ Vladimir G. Kim ⋅ Wang Yifan ⋅ Daniel Ritchie
Jun 5, 4:00 PM - 6:00 PM ExHall A & F
We introduce a framework for converting 3D shapes into compact and editable assemblies of analytic primitives, directly addressing the persistent trade-off between reconstruction fidelity and parsimony. Our approach combines two key contributions: a novel primitive, termed SuperFrustum, and an iterative inference algorithm, Residual Primitive Fitting (ResFit). SuperFrustum is a analytical primitive that is simultaneously (1) expressive, being able to express various common solids such as cylinders, spheres, cones & their tapered and bent forms, (2) editable, being compactly parameterized with 8 parameters, and (3) optimizable, with a sign distance field differentiable w.r.t. its parameters almost everywhere. ResFit is an unsupervised procedure that interleaves global shape analysis with local optimization, iteratively fitting primitives to the unexplained residual of a shape to discover a parsimonious yet accurate decompositions for each input shape. On diverse 3D benchmarks, our method achieves state-of-the-art results, improving IoU by over 9 points while using nearly half as many primitives as prior work. The resulting assemblies bridge the gap between dense 3D data and human-controllable design, producing high-fidelity and editable shape programs.
Show more
View full details
Poster

Rethinking Dataset Distillation: Hard Truths about Soft Labels

Priyam Dey ⋅ Aditya Sahdev ⋅ Sunny Bhati ⋅ Konda Reddy Mopuri ⋅ R. Venkatesh Babu
Jun 5, 10:45 AM - 12:45 PM ExHall A-F
Despite the perceived success of large-scale dataset distillation (DD) methods, recent evidence \cite{qin2024a} finds that simple random image baselines perform on-par with state-of-the-art DD methods like SRe2L \cite{yin2024squeezerecoverrelabeldataset} due to the use of soft labels during downstream model training. This is in contrast with the findings in coreset literature, where high-quality coresets consistently outperform random subsets in the hard-label (HL) setting. To understand this discrepancy, we perform a detailed scalability analysis to examine the role of data quality under different label regimes, ranging from abundant soft labels (termed as SL+KD regime) to fixed soft labels (SL) and hard labels (HL). Our analysis reveals that high-quality coresets fail to convincingly outperform the random baseline in both SL and SL+KD regimes. In the SL+KD setting, performance further approaches near-optimal levels relative to the full dataset, regardless of subset size or quality, for a given compute budget. This performance saturation calls into question the widespread practice of using soft labels for model evaluation, where unlike the HL setting, subset quality has negligible influence. A subsequent systematic evaluation of five large-scale and four small-scale DD methods in the HL setting reveals that only RDED \cite{sun2024diversityrealismdistilleddataset} reliably outperforms random baselines on ImageNet-1K, but can still lag behind strong coreset methods due to its over-reliance on easy sample patches. Based on this, we introduce CAD-Prune, a compute-aware pruning metric that efficiently identifies samples of optimal difficulty for a given compute budget, and use it to develop CA2D, a compute-aligned DD method, outperforming current DD methods on ImageNet-1K at various IPC settings. Together, our findings uncover many insights into current DD research and establish useful tools to advance data-efficient learning for both coresets and DD.
Show more
View full details
Poster

Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization

Jeonggon Kim ⋅ Heejoon Moon ⋅ Je Hyeong Hong
Jun 5, 10:45 AM - 12:45 PM ExHall A-F
Privacy-Preserving Image Queries (PPIQ) are an emerging mechanism for cloud-based visual localization, enabling pose estimation from obfuscated features instead of private images or raw keypoints.However, the main approaches for PPIQ, primarily geometry-based and segmentation-based obfuscation, both suffer from vulnerabilities to recent privacy attacks.In particular, a fundamental limitation of geometry-based obfuscation is that the spatial distribution of obfuscated neighboring lines still effectively surrounds the original keypoint location, providing exploitable cues for recovering the original points.We revisit this geometric paradigm and introduce Dual Convergent Lines (DCL), a novel keypoint obfuscation method demonstrating strong resilience against such attack.DCL places two fixed anchors on a central partition line and lifts each keypoint to a line originating from one of them, with the active anchor determined by the keypoint's location.This arrangement invalidates the geometry-recovery attack by making its optimization ill-posed:Neighboring lines either misleadingly converge to one anchor, yielding a trivial solution, or become near-parallel at the partition boundary, yielding an unstable high-variance solution. Both outcomes thwart point recovery.DCL is also compatible with an existing line-based solver, enabling deployment in traditional localization pipelines.Experiments on both indoor and large-scale outdoor datasets demonstrate DCL's robustness against privacy attacks, efficiency, and scalability, while achieving practical localization performance.
Show more
View full details
Poster

SAM 3D: 3Dfy Anything in Images

Xingyu Chen ⋅ Fu-Jen Chu ⋅ Pierre Gleize ⋅ Kevin J Liang ⋅ Alexander Sax ⋅ Hao Tang ⋅ Weiyao Wang ⋅ Michelle Guo ⋅ Thibaut Hardin ⋅ Xiang Li ⋅ Aohan Lin ⋅ Jia-Wei Liu ⋅ Ziqi Ma ⋅ Anushka Sagar ⋅ Bowen Song ⋅ Xiaodong Wang ⋅ Jianing "Jed" Yang ⋅ Bowen Zhang ⋅ Piotr Dollár ⋅ Georgia Gkioxari ⋅ Matt Feiszli ⋅ Jitendra Malik
Jun 5, 4:00 PM - 6:00 PM ExHall A & F
We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a $5:1$ win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.
Show more
View full details
Poster

SAM 3D Body: Robust Full-Body Human Mesh Recovery

Xitong Yang ⋅ Devansh Kukreja ⋅ Don Pinkus ⋅ Taosha Fan ⋅ Jinhyung Park ⋅ Soyong Shin ⋅ Jinkun Cao ⋅ Jia-Wei Liu ⋅ Nicolás Ugrinovic ⋅ Anushka Sagar ⋅ Jitendra Malik ⋅ Matt Feiszli ⋅ Piotr Dollár ⋅ Kris Kitani
Jun 5, 4:00 PM - 6:00 PM ExHall A & F
We introduce SAM 3D Body (3DB), a promptable model for single-image full-body 3D human mesh recovery (HMR) that demonstrates state-of-the-art performance, with strong generalization and consistent accuracy in diverse in-the-wild conditions. 3DB estimates the human pose of the body, feet, and hands. It is the first model to use a new parametric mesh representation, Momentum Human Rig (MHR), which decouples skeletal pose and body shape. 3DB employs an encoder–decoder architecture and supports auxiliary prompts, including 2D keypoints and masks, enabling user-guided inference similar to the SAM family of models. We derive high-quality annotations from a multi-stage annotation pipeline that uses various combinations of manual keypoint annotation, differentiable optimization, multi-view geometry, and dense keypoint detection. Our data engine efficiently selects and processes data to ensure data diversity, collecting unusual poses and rare imaging conditions. We present a new evaluation dataset organized by pose and appearance categories, enabling nuanced analysis of model behavior. Our experiments demonstrate superior generalization and substantial improvements over prior methods in both qualitative user preference studies and traditional quantitative analysis. Both 3DB and MHR are open-source.
Show more
View full details
Poster

SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models

Jiwoo Chung ⋅ Sangeek Hyun ⋅ MinKyu Lee ⋅ Byeongju Han ⋅ Geonho Cha ⋅ Dongyoon Wee ⋅ Youngjun Hong ⋅ Jae-Pil Heo
Jun 6, 11:45 AM - 1:45 PM ExHall F
Diffusion models are a strong backbone for visual generation, but their inherently sequential denoising process leads to slow inference. Previous methods accelerate sampling by caching and reusing intermediate outputs based on feature distances between adjacent timesteps. However, existing caching strategies typically rely on raw feature differences that entangle content and noise. This design overlooks spectral evolution, where low-frequency structure appears early and high-frequency detail is refined later. We introduce Spectral-Evolution-Aware Cache (SeaCache), a training-free cache schedule that bases reuse decisions on a spectrally aligned representation. Through theoretical and empirical analysis, we derive a Spectral-Evolution-Aware (SEA) filter that preserves content-relevant components while suppressing noise. Employing SEA-filtered input features to estimate redundancy leads to dynamic schedules that adapt to content while respecting the spectral priors of the underlying diffusion model. Extensive experiments on diverse visual generative models and the baselines show that SeaCache achieves state-of-the-art latency-quality trade-offs.
Show more
View full details
Poster

SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping

Hongyu Wen ⋅ Jia Deng
Jun 5, 4:00 PM - 6:00 PM ExHall A & F
Transparent objects are common in daily life, and understanding their multi-layer depth information, including both the transparent surface and the objects behind it, is crucial for real-world applications that interact with transparent materials.However, existing depth methods produce only a single depth map, which is inherently ambiguous for transparent surfaces.In this work, We propose a multi-layer depth estimation method, SeeGroup, consisting of novel recurrent decomposition module design and an intensity-based formulation for multi-layer depth. Experiments demonstrate that our method significantly improves the state of the art of multi-layer depth estimation, improving quadruplet relative depth accuracy on LayeredDepth benchmark from 61.34\% to 70.67\%.
Show more
View full details
Poster

SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models

Chen Li ⋅ Shanshan Dong ⋅ Sheng Qiu ⋅ Jianmin Han ⋅ Yibo Zhao ⋅ Zan Gao ⋅ Taku Komura ⋅ Kemeng Huang
Jun 5, 4:00 PM - 6:00 PM ExHall A & F
Reconstructing dynamic fluids from sparse views is a long-standing and challenging problem, due to the severe lack of 3D information from insufficient view coverage. While several pioneering approaches have attempted to address this issue using differentiable rendering or novel view synthesis, they are often limited by time-consuming optimization under ill-posed conditions. We propose SmokeSVD, an efficient and effective framework to progressively reconstruct dynamic smoke from a single video by integrating the generative capabilities of diffusion models with physically guided consistency optimization. Specifically, we first propose a physically guided side-view synthesizer based on diffusion models, which explicitly incorporates velocity field constraints to generate spatio-temporally consistent side-view images frame by frame, significantly alleviating the ill-posedness of single-view reconstruction. Subsequently, we iteratively refine novel-view images and reconstruct 3D density fields through a progressive multi-stage process that renders and enhances images from increasing viewing angles, generating high-quality multi-view sequences. Finally, we estimate fine-grained density and velocity fields via differentiable advection by leveraging the Navier-Stokes equations. Our approach supports re-simulation and downstream applications while achieving superior reconstruction quality and computational efficiency compared to state-of-the-art methods.
Show more
View full details
Poster

SoccerMaster: A Vision Foundation Model for Soccer Understanding

Haolin Yang ⋅ Jiayuan Rao ⋅ Haoning Wu ⋅ Weidi Xie
Jun 6, 4:45 PM - 6:45 PM ExHall A & F
Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges.However, prior works typically rely on task-specific expert models, which are resource-intensive and hinder a holistic view of the game.This paper aims to propose a unified framework that enables a single model to handle diverse soccer visual understanding tasks, spanning both fine-grained perception (e.g., athlete detection) and semantic reasoning (e.g., event classification).Concretely, we make the following contributions in this paper:(i) we present **SoccerMaster**, the first soccer-specific vision foundation model that unifies comprehensive understanding tasks within a single framework via **supervised multi-task pretraining**;(ii) we consolidate multiple existing soccer video datasets and develop an automated data curation pipeline, termed as **SoccerFactory**, to produce scalable multi-task training annotations;and (iii) we conduct extensive experiments demonstrating that SoccerMaster consistently outperforms task-specific expert models across diverse downstream tasks, underscoring its breadth and superiority.The data, code, and model will be publicly available to the research community.
Show more
View full details
Poster

SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation

Ziyi Chen ⋅ Yingnan Guo ⋅ Zedong Chu ⋅ Minghua Luo ⋅ Yanfen Shen ⋅ Mingchao Sun ⋅ Junjun Hu ⋅ Shichao Xie ⋅ Yang Kuan ⋅ Pei Shi ⋅ Zhining Gu ⋅ Lu Liu ⋅ Honglin Han ⋅ Xiaolong Wu ⋅ Mu Xu ⋅ Yu Zhang
Jun 7, 11:45 AM - 1:45 PM ExHall F
Embodied navigation that adheres to social norms remains an open research challenge. Our SocialNav is a foundational model for socially-aware navigation with a hierarchical "brain-action" architecture, capable of understanding high-level social norms and generating low-level, socially compliant trajectories. To enable such dual capabilities, we construct the SocNav Dataset, a large-scale collection of 7 million samples, comprising (1) a Cognitive Activation Dataset providing social reasoning signals such as chain-of-thought explanations and social traversability prediction, and (2) an Expert Trajectories Pyramid aggregating diverse navigation demonstrations from internet videos, simulated environments, and real-world robots. A multi-stage training pipeline is proposed to gradually inject and refine navigation intelligence: we first inject general navigation skills and social norms understanding into the model via imitation learning, and then refine such skills through a deliberately designed Socially-Aware FlowExploration GRPO (SAFE-GRPO), the first flow-based reinforcement learning framework for embodied navigation that explicitly rewards socially compliant behaviors. SocialNav achieves +38% success rate and +46% social compliance rate compared to the state-of-the-art method, demonstrating strong gains in both navigation performance and social compliance. Data and code will be made publicly available.
Show more
View full details
Poster

SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model

Jiayuan Du ⋅ Yiming Zhao ⋅ Zhenglong Guo ⋅ Yong Pan ⋅ Wenbo Hou ⋅ Zhihui Hao ⋅ Kun Zhan ⋅ Qijun Chen
Jun 5, 4:00 PM - 6:00 PM ExHall A & F
This paper introduces a novel architecture for trajectory-conditioned forecasting of future 3D scene occupancy. In contrast to methods that rely on variational autoencoders (VAEs) to generate discrete occupancy tokens, which inherently limit representational capacity, our approach predicts multi-frame future occupancy in an end-to-end manner directly from raw image features. Inspired by the success of attention-based transformer architectures in foundational vision and language models such as GPT and VGGT, we employ a sparse occupancy representation that bypasses the intermediate bird’s eye view (BEV) projection and its explicit geometric priors. This design allows the transformer to capture spatiotemporal dependencies more effectively. By avoiding both the finite-capacity constraint of discrete tokenization and the structural limitations of BEV representations, our method achieves state-of-the-art performance on the nuScenes benchmark for 1‒3 second occupancy forecasting, outperforming existing approaches by a significant margin. Furthermore, it demonstrates robust scene dynamics understanding, consistently delivering high accuracy under arbitrary future trajectory conditioning.
Show more
View full details
Poster

Texvent: Asynchronous Event Data Simulation via Text Prompt

Ruofei Wang ⋅ Peiqi Duan ⋅ Ka Chun Cheung ⋅ Simon See ⋅ Boxin Shi ⋅ Renjie Wan
Jun 7, 3:30 PM - 5:30 PM ExHall A
Current event simulation methods focus on employing videos to synthesize new event data, suffering from costly video capture and limited scalability across viewpoints, motions, and lighting. To this end, we propose a Text-to-event simulation framework (Texvent) that can directly generate asynchronous event data from simple text prompts. Texvent first renders prompt-driven videos via multimodal large language models and subsequently applies a new physical simulator to generate event streams. Specifically, an adaptive brightness-aware frame interpolation approach is proposed to enhance the temporal resolution of the rendered videos. A balanced logarithmic intensity comparison strategy and a cache–based voltage refreshment mechanism are introduced into the simulator to generate event data.To narrow the sim-to-real gap, we also introduce background activity noise injection and dense time stamp reconstruction operations. Extensive experiments demonstrate Texvent’s superior computational efficiency and its ability to generate more realistic event data than existing simulators.
Show more
View full details
Poster

The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification

Dante Wasmuht ⋅ Otto Brookes ⋅ Maximilian Schall ⋅ Pablo Palencia ⋅ Christopher Beirne ⋅ Tilo Burghardt ⋅ Majid Mirmehdi ⋅ Hjalmar Kühl ⋅ Mimi Arandjelovic ⋅ Sam Pottie ⋅ Peter Bermant ⋅ Brandon Asheim ⋅ Yi Jin Toh ⋅ Adam Elzinga ⋅ Jason Allan Holmberg ⋅ Andrew Whitworth ⋅ Eleanor Flatt ⋅ Laura Gustafson ⋅ Chaitanya Ryali ⋅ Yuan-Ting Hu ⋅ Baishan Guo ⋅ Andrew Westbury ⋅ Kate Saenko ⋅ Dídac Surís
Jun 6, 4:45 PM - 6:45 PM ExHall A & F
Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity -- leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in $\sim$46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multi-animal tracking in the wild. The dataset is available at [ANONYMIZED]
Show more
View full details
Poster

Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding

Pengfei Hu ⋅ Meng Cao ⋅ Yingyao Wang ⋅ Yi Wang ⋅ Jiahua Dong ⋅ Jun Song ⋅ Cheng Yu ⋅ Bo Zheng ⋅ Xiaodan Liang
Jun 7, 3:30 PM - 5:30 PM ExHall A
Long video understanding is essential for human-like intelligence, enabling coherent perception and reasoning over extended temporal contexts. While the emerging thinking-with-frames paradigm—which alternates between global temporal reasoning and local frame examination—has advanced the reasoning capabilities of video multi-modal large language models (MLLMs), it suffers from a significant efficiency bottleneck due to the progressively growing and redundant multi-modal context. To address this, we propose SpecTemp, a reinforcement learning-based Speculative Temporal reasoning framework that decouples temporal perception from reasoning via a cooperative dual-model design. In SpecTemp, a lightweight draft MLLM rapidly explores and proposes salient frames from densely sampled temporal regions, while a powerful target MLLM focuses on temporal reasoning and verifies the draft’s proposals, iteratively refining its attention until convergence. This design mirrors the collaborative pathways of the human brain, balancing efficiency with accuracy. To support training, we construct the SpecTemp-80K dataset, featuring synchronized dual-level annotations for coarse evidence spans and fine-grained frame-level evidence. Experiments across multiple video understanding benchmarks demonstrate that SpecTemp not only maintains competitive accuracy but also significantly accelerates inference compared with existing thinking-with-frames methods.
Show more
View full details
Poster

Towards Photorealistic and Efficient Bokeh Rendering via Diffusion Framework

Linxiao Shi ⋅ Siming Zheng ⋅ Zerong Wang ⋅ Hao Zhang ⋅ Jinwei Chen ⋅ Bo Li ⋅ Shifeng Chen ⋅ Peng-Tao Jiang
Jun 5, 10:45 AM - 12:45 PM ExHall A-F
Existing mobile devices are constrained by compact optical designs, such as small apertures, which make it difficult to produce natural, optically realistic bokeh effects. Although recent learning-based methods have shown promising results, they still struggle with photos captured under high digital zoom levels, which often suffer from reduced resolution and loss of fine details. A naive solution is to enhance image quality before applying bokeh rendering, yet this two-stage pipeline reduces efficiency and introduces unnecessary error accumulation. To overcome these limitations, we propose MagicBokeh, a unified diffusion-based framework designed for high-quality and efficient bokeh rendering. Through an alternative training strategy and a focus-aware masked attention mechanism, our method jointly optimizes bokeh rendering and super-resolution, substantially improving both controllability and visual fidelity. Furthermore, we introduce degradation-aware depth module to enable more accurate depth estimation from low-quality inputs. Experimental results demonstrate that MagicBokeh efficiently produces photorealistic bokeh effects, particularly on real-world low-resolution images, paving the way for future advancements in bokeh rendering. The code will be released publicly.
Show more
View full details
Poster

U^2Flow: Uncertainty-Aware Unsupervised Optical Flow Estimation

Xunpei Sun ⋅ Wenwei Lin ⋅ Yi Chang ⋅ Gang Chen
Jun 7, 11:45 AM - 1:45 PM ExHall F
Existing unsupervised optical flow methods typically lack reliable uncertainty estimation, limiting their robustness and interpretability. We propose U$^{2}$Flow, the first recurrent unsupervised framework that jointly estimates optical flow and per-pixel uncertainty. The core innovation is a decoupled learning strategy that derives uncertainty supervision from augmentation consistency via a Laplace-based maximum likelihood objective, enabling stable training without ground truth. The predicted uncertainty is further integrated into the network to guide adaptive flow refinement and dynamically modulate the regional smoothness loss. Furthermore, we introduce an uncertainty-guided bidirectional flow fusion mechanism that enhances robustness in challenging regions. Extensive experiments on KITTI and Sintel demonstrate that U$^{2}$Flow achieves state-of-the-art performance among unsupervised methods while producing highly reliable uncertainty maps, validating the effectiveness of our joint estimation paradigm.
Show more
View full details
Poster

UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision

Alberto Rota ⋅ Mert Kiray ⋅ Mert Asim Karaoglu ⋅ Patrick Ruhkamp ⋅ Elena De Momi ⋅ Nassir Navab ⋅ Benjamin Busam
Jun 5, 10:45 AM - 12:45 PM ExHall A-F
Specular highlights distort appearance, obscure texture, and hinder geometric reasoning in both natural and surgical imagery. We present UnReflectAnything, an RGB-only framework that removes highlights from a single image by predicting a highlight map together with a reflection-free diffuse reconstruction. The model uses a frozen vision transformer encoder to extract multi-scale features, a lightweight head to localize specular regions, and a token-level inpainting module that restores corrupted feature patches before producing the final diffuse image. To overcome the lack of paired supervision, we introduce a Virtual Highlight Synthesis pipeline that renders physically plausible specularities using monocular geometry, Fresnel-aware shading, and randomized lighting which enables training on arbitrary RGB images with correct geometric structure. UnReflectAnything generalizes across natural and surgical domains where non-Lambertian surfaces and non-uniform lighting create severe highlights and it achieves competitive performance with state-of-the-art results on several benchmarks.
Show more
View full details
Poster

VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation

Yulu Gao ⋅ Bohao Zhang ⋅ Zongheng Tang ⋅ Jitong Liao ⋅ wenjun wu ⋅ Si Liu
Jun 6, 4:45 PM - 6:45 PM ExHall A & F
Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level attention remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, coarse point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and enables strong zero-shot generalization. On the challenging Ego–Exo4D benchmark, VGGT-S sets a new state-of-the-art, achieving 67.7% and 68.0% average IoU for Ego→Exo and Exo→Ego tasks, respectively, significantly outperforming prior methods. Notably, our zero-shot model surpasses most fully-supervised baselines, demonstrating the effectiveness and scalability of our approach.
Show more
View full details
Poster

VGGT-Ω

Jianyuan Wang ⋅ Minghao Chen ⋅ Shangzhan Zhang ⋅ Nikita Karaev ⋅ Johannes Schönberger ⋅ Patrick Labatut ⋅ Piotr Bojanowski ⋅ David Novotny ⋅ Andrea Vedaldi ⋅ Christian Rupprecht
Jun 6, 4:45 PM - 6:45 PM ExHall A & F
We present VGGT-Ω, a feed-forward model for 3D reconstruction that substantially advances the state of the art in accuracy, efficiency, and capability for both static and dynamic scenes. Prior models such as VGGT have shown that feed-forward 3D reconstruction can already be competitive with traditional optimization-based methods. Here, we further demonstrate that the accuracy and robustness of these models scale predictably with model capacity and data size. To enable training 3D reconstruction models at an unprecedented scale, we introduce a high-quality data annotation pipeline that handles dynamic scenes, a self-supervised learning protocol, and architectural changes that greatly reduce memory requirements. We significantly simplify VGGT’s architecture by replacing multiple dense prediction heads with loss-driven multitask learning, removing unstable DPT blocks, and introducing more efficient global attention via scene tokens. These changes allow us to efficiently train VGGT-Ω with 20$\times$ more supervised data and 100$\times$ more unsupervised data than prior work, while requiring only 30% of VGGT’s memory and running 1.6$\times$ faster at inference. As a result, VGGT-Ω establishes a new state of the art for 3D reconstruction on both static and dynamic scenes across a wide range of benchmarks, e.g., improving the camera estimation accuracy by 77% on the Sintel dataset. Models and code will be publicly released.
Show more
View full details
Poster

ViT^3: Unlocking Test-Time Training in Vision

Dongchen Han ⋅ Yining Li ⋅ Tianyu Li ⋅ Zixuan Cao ⋅ Ziming Wang ⋅ Jun Song ⋅ Cheng Yu ⋅ Bo Zheng ⋅ Gao Huang
Jun 5, 10:45 AM - 12:45 PM ExHall A-F
Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key–value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training (ViT$^3$) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate ViT$^3$ across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that ViT$^3$ consistently matches or outperforms advanced linear-complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the ViT$^3$ baseline can facilitate future work on visual TTT models. Code will be released.
Show more
View full details