Skip to yearly menu bar Skip to main content


Oral Session

Oral Session 1C: Image Processing and Deep Architectures

Fri 13 Jun 7 a.m. PDT — 8:15 a.m. PDT
Abstract:
Chat is not available.

Fri 13 June 7:00 - 7:15 PDT

CleanDIFT: Diffusion Features without Noise

Nick Stracke · Stefan Andreas Baumann · Kolja Bauer · Frank Fundel · Björn Ommer

Internal features from large-scale pre-trained diffusion models have recently been established as powerful semantic descriptors for a wide range of downstream tasks. Works that use these features generally need to add noise to images before passing them through the model to obtain the semantic features, as the models do not offer the most useful features when given images with little to no noise. We show that this noise has a critical impact on the usefulness of these features that cannot be remedied by ensembling with different random noises. We address this issue by introducing a lightweight, unsupervised fine-tuning method that enables diffusion backbones to provide high-quality, noise-free semantic features. We show that these features readily outperform previous diffusion features by a wide margin in a wide variety of extraction setups and downstream tasks, offering better performance than even ensemble-based methods at a fraction of the cost.

In the human vision system, top-down attention plays a crucial role in perception, wherein the brain initially performs an overall but rough scene analysis to extract salient cues (i.e., overview first), followed by a finer-grained examination to make more accurate judgments (i.e., look closely next). However, recent efforts in ConvNet designs primarily focused on increasing kernel size to obtain a larger receptive field without considering this crucial biomimetic mechanism to further improve performance. To this end, we propose a novel pure ConvNet vision backbone, termed OverLoCK, which is carefully devised from both the architecture and mixer perspectives. Specifically, we introduce a biomimetic Deep-stage Decomposition Strategy (DDS) that fuses semantically meaningful context representations into middle and deep layers by providing dynamic top-down context guidance at both feature and kernel weight levels. To fully unleash the power of top-down context guidance, we further propose a novel **Cont**ext-**Mix**ing Dynamic Convolution (ContMix) that effectively models long-range dependencies while preserving inherent local inductive biases even when the input resolution increases. These properties are absent in previous convolutions. With the support from both DDS and ContMix, our OverLoCK exhibits notable performance improvement over existing methods. For instance, OverLoCK-T achieves a Top-1 accuracy of 84.2\%, significantly surpassing ConvNeXt-B while only using around one-third of the FLOPs/parameters. On object detection with Cascade Mask R-CNN, our OverLoCK-S surpasses MogaNet-B by a significant 1\% in AP$^b$. On semantic segmentation with UperNet, our OverLoCK-T remarkably improves UniRepLKNet-T by 1.7\% in mIoU.

Fri 13 June 7:30 - 7:45 PDT

Towards Explicit Geometry-Reflectance Collaboration for Generalized LiDAR Segmentation in Adverse Weather

Longyu Yang · Ping Hu · Shangbo Yuan · Lu Zhang · Jun Liu · Heng Tao Shen · Xiaofeng Zhu

Existing LiDAR semantic segmentation models often suffer from decreased accuracy when exposed to adverse weather conditions. Recent methods addressing this issue focus on enhancing training data through weather simulation or universal augmentation techniques. However, few works have studied the negative impacts caused by the heterogeneous domain shifts in the geometric structure and reflectance intensity of point clouds. In this paper, we delve into this challenge and address it with a novel Geometry-Reflectance Collaboration (GRC) framework that explicitly separates feature extraction for geometry and reflectance. Specifically, GRC employs a dual-branch architecture designed to process geometric and reflectance features independently initially, thereby capitalizing on their distinct characteristic. Then, GRC adopts a robust multi-level feature collaboration module to suppress redundant and unreliable information from both branches. Consequently, without complex simulation or augmentation, our method effectively extracts intrinsic information about the scene while suppressing interference, thus achieving better robustness and generalization in adverse weather conditions. We demonstrate the effectiveness of GRC through comprehensive experiments on challenging benchmarks, showing that our method outperforms previous approaches and establishes new state-of-the-art results.

Fri 13 June 7:45 - 8:00 PDT

DiffFNO: Diffusion Fourier Neural Operator

Xiaoyi Liu · Hao Tang

We introduce DiffFNO, a novel framework for arbitrary-scale super-resolution that incorporates a Weighted Fourier Neural Operator (WFNO) enhanced by a diffusion process. DiffFNO's adaptive mode weighting mechanism in the Fourier domain effectively captures critical frequency components, significantly improving the reconstruction of high-frequency image details that are essential for super-resolution tasks.Additionally, we propose a Gated Fusion Mechanism to efficiently integrate features from the WFNO and an attention-based neural operator, enhancing the network's capability to capture both global and local image details. To further improve efficiency, DiffFNO employs a deterministic ODE sampling strategy called the Adaptive Time-step ODE Solver (AT-ODE), which accelerates inference by dynamically adjusting step sizes while preserving output quality.Extensive experiments demonstrate that DiffFNO achieves state-of-the-art results, outperforming existing methods across various scaling factors, including those beyond the training distribution, by a margin of 2–4 dB in PSNR. Our approach sets a new standard in super-resolution, delivering both superior accuracy and computational efficiency.

Fri 13 June 8:00 - 8:15 PDT

Removing Reflections from RAW Photos

Eric Kee · Adam Pikielny · Kevin Blackburn-Matzen · Marc Levoy

We describe a system to remove real-world reflections from images for consumer photography. Our system operates on linear (RAW) photos, and accepts an optional contextual photo looking in the opposite direction (e.g., the "selfie" camera on a mobile device). This optional photo helps disambiguate what should be considered the reflection. The system is trained solely on synthetic mixtures of real-world RAW images, which we combine using a reflection simulation that is photometrically and geometrically accurate. Our system comprises a base model that accepts the captured photo and optional context photo as input, and runs at 256p, followed by an up-sampling model that transforms 256p images to full resolution. The system can produce images for review at 1K in 4.5 to 6.5 seconds on a MacBook or iPhone 14 Pro. We test on RAW photos that were captured in the field and embody typical consumer photos, and show that our RAW-image simulation yields SOTA performance.