CVPR 2025 Schedule

Filter Events

WED 11 JUN

7 a.m.

Registration / Badge Pickup

(ends 5:00 PM)

8 a.m.

Workshop:

LatinX in Computer Vision Research Workshop

(ends 12:30 PM)

Workshop:

M&M: Multi-modal Models and Medicine

(ends 12:00 PM)

Workshop:

8th Workshop on Efficient Deep Learning for Computer Vision

(ends 12:00 PM)

Tutorial:

Volumetric Video in the Real World

(ends 5:00 PM)

Tutorial:

Cognitive AI for the Future: Agentic Multimodal Models and RAG for Vision Language Applications, from Training to Deployment

(ends 12:00 PM)

Tutorial:

Foundations of Interpretable AI

(ends 12:00 PM)

Workshop:

The Second Workshop on: Computer Vision For Videogames (CV2)

(ends 12:00 PM)

Tutorial:

Scalable Generative Models in Computer Vision

(ends 5:00 PM)

Workshop:

Computer Vision for Mixed Reality

(ends 12:00 PM)

Tutorial:

From Video Generation to World Model

(ends 5:00 PM)

Workshop:

The 2nd Workshop on Foundation Models for Medical Vision

(ends 12:30 PM)

Tutorial:

The 2nd Point Cloud Tutorial: All You Need To Know About 3D Point Cloud

(ends 5:00 PM)

Workshop:

10th New Trends in Image Restoration and Enhancement Workshop and Challenges

(ends 7:00 PM)

8:10 a.m.

Workshop:

4th edition of Computer Vision for Metaverse Workshop

(ends 12:20 PM)

8:15 a.m.

Workshop:

Navigating the Future: Ensuring Trustworthiness in Multi-Modal Open-World Intelligence

(ends 5:30 PM)

8:25 a.m.

Workshop:

2nd MetaFood Workshop

(ends 12:30 PM)

8:30 a.m.

Workshop:

5th Workshop on 3D Scene Understanding for Vision, Graphics, and Robotics

(ends 1:00 PM)

Workshop:

The Sixth Workshop on Fair, Data-efficient, and Trusted Computer Vision

(ends 1:00 PM)

Workshop:

The 2nd Workshop on Equivariant Vision: From Theory to Practice

(ends 5:00 PM)

Workshop:

Data Driven Autonomous Driving Simulation (DDADS)

(ends 5:45 PM)

Workshop:

CV4Science 2025: Using Computer Vision for the Sciences

(ends 5:00 PM)

Workshop:

BEAM 2025: Benchmarking and Expanding AI Multimodal Approaches

(ends 12:30 PM)

Workshop:

Visual Perception and Learning in an Open World

(ends 5:40 PM)

Workshop:

The 4th Workshop on Federated Learning for Computer Vision

(ends 5:30 PM)

Workshop:

Sight and Sound

(ends 12:30 PM)

Workshop:

8th Multimodal Learning and Applications Workshop

(ends 1:15 PM)

Workshop:

The 1st Workshop on Humanoid Agents

(ends 5:30 PM)

8:45 a.m.

Workshop:

Workshop on Autonomous Driving

(ends 5:00 PM)

Workshop:

FGVC12: 12th Workshop on Fine-grained Visual Categorization

(ends 6:00 PM)

Workshop:

3rd Workshop on Generative Models for Computer Vision

(ends 5:45 PM)

Workshop:

Global 3D Human Poses

(ends 1:30 PM)

Workshop:

Computational Cameras and Displays

(ends 5:00 PM)

8:50 a.m.

Workshop:

6th International Workshop on Large Scale Holistic Video Understanding

(ends 12:10 PM)

9 a.m.

Workshop:

EarthVision: Large Scale Computer Vision for Remote Sensing Imagery

(ends 5:30 PM)

Workshop:

Generalization in Robotics Manipulation Workshop and Challenges

(ends 1:00 PM)

Workshop:

Foundation Models Meet Embodied Agents

(ends 5:30 PM)

Workshop:

2nd GenAI Media Generation Challenge Workshop

(ends 12:00 PM)

Workshop:

Demographic diversity in computer vision

(ends 6:00 PM)

Workshop:

3D Vision Language Model for Robotics Manipulation: Opportunities and Challenges

(ends 5:00 PM)

Workshop:

Mobile AI workshop and associated challenges, 5th edition

(ends 4:00 PM)

Tutorial:

Tackling 3D Deep Learning, Gaussian Splats and Physics Simulation with NVIDIA Kaolin Library, a Hands-On Lab

(ends 12:00 PM)

Workshop:

Uncertainty Quantification for Computer Vision

(ends 5:00 PM)

Workshop:

2nd Workshop on Urban Scene Modeling: Where Vision meets Photogrammetry and Graphics (USM3D)

(ends 6:00 PM)

Workshop:

Synthetic Data for Computer Vision Workshop

(ends 5:00 PM)

Workshop:

Workshop on Video Large Language Models

(ends 6:00 PM)

Workshop:

Embodied Intelligence for Autonomous Systems on the Horizon

(ends 6:00 PM)

Workshop:

8th International Workshop on Visual Odometry and Computer Vision Applications Based on Location Clues

(ends 5:30 PM)

Workshop:

4th Workshop on Computer Vision in the Wild

(ends 5:00 PM)

9:20 a.m.

Workshop:

Workshop on 4D Vision: Modeling the Dynamic World

(ends 5:30 PM)

9:50 a.m.

Workshop:

Exploring the Next Generation of Data

(ends 5:00 PM)

12:45 p.m.

Workshop:

Emergent Visual Abilities and Limits of Foundation Models (EVAL-FoMo 2)

(ends 6:00 PM)

12:50 p.m.

Workshop:

How to Stand Out in the Crowd?

(ends 6:10 PM)

1 p.m.

Workshop:

CVPR 2025 Photorealistic Avatar Challenge

(ends 4:00 PM)

Workshop:

Image Matching: Local Features and Beyond

(ends 5:30 PM)

Workshop:

2nd Workshop on Human Motion Generation (HuMoGen)

(ends 6:00 PM)

Tutorial:

Evaluations and Benchmarks in Context of Multimodal LLM

(ends 5:00 PM)

Workshop:

The 4th Explainable AI for Computer Vision (XAI4CV) Workshop

(ends 5:00 PM)

Workshop:

AVA: Accessibility, Vision, and Autonomy Meet

(ends 5:00 PM)

Workshop:

2nd Workshop on Neural Fields Beyond Conventional Cameras

(ends 6:00 PM)

Workshop:

Photo-realistic 3D Head Avatars

(ends 6:30 PM)

Tutorial:

Evaluating Large Multi-modal Models: Challenges and Methods

(ends 5:00 PM)

1:30 p.m.

Workshop:

Pixel-level Video Understanding in the Wild Challenge

(ends 6:00 PM)

Workshop:

Multimodal Foundation Models for Biomedicine: Challenges and Opportunities

(ends 5:30 PM)

Workshop:

Workshop on Vision-based Assistants in the Real-world

(ends 6:00 PM)

1:40 p.m.

Workshop:

Multimodal Algorithmic Reasoning Workshop

(ends 6:00 PM)

THU 12 JUN

7 a.m.

Registration / Badge Pickup

(ends 5:00 PM)

8 a.m.

Workshop:

5th International Workshop on Event-based Vision

(ends 6:00 PM)

Tutorial:

Computer Vision over Homomorphically Encrypted Data

(ends 12:00 PM)

Workshop:

1st International Workshop on Interactive Video Search and Exploration (IViSE)

(ends 12:30 PM)

Workshop:

Workshop on Computer Vision for Microscopy Image Analysis

(ends 6:00 PM)

Tutorial:

Geospatial Computer Vision and Artificial Intelligence for Large-Scale Earth Observation Data

(ends 5:00 PM)

Workshop:

Workshop on Visual Concepts

(ends 5:00 PM)

Workshop:

Workshop on Foundation and Large Vision Models in Remote Sensing

(ends 12:30 PM)

Tutorial:

Edge AI in Action: Technologies and Applications

(ends 12:00 PM)

Workshop:

DriveX - Foundation Models for V2X-Based Cooperative Autonomous Driving

(ends 6:00 PM)

Tutorial:

Efficient Text-to-Image/Video modeling

(ends 12:00 PM)

Workshop:

Three things everyone should ask about photorealistic virtual try-on.

(ends 12:30 PM)

Tutorial:

Animal re-identification

(ends 12:00 PM)

Tutorial:

Sense, Perceive, Interact & Render on Android XR

(ends 5:00 PM)

Workshop:

Efficient Large Vision Models

(ends 12:30 PM)

Tutorial:

Robotics 101: An Odyssey from A Vision Perspective

(ends 5:00 PM)

Workshop:

C3DV: 3rd Workshop on Compositional 3D Vision

(ends 12:00 PM)

Workshop:

Workshop on Distillation of Foundation Models for Autonomous Driving

(ends 12:30 PM)

Workshop:

Workshop on 3D-LLM/VLA: Bridging Language, Vision and Action in 3D Environments

(ends 12:45 PM)

Tutorial:

Multi-Modal Computer Vision and Foundation Models In Agriculture in conjunction with IEEE CVPR 2025

(ends 12:00 PM)

Tutorial:

Continuous Data Cycle via Foundation Models

(ends 12:00 PM)

Tutorial:

3D Shape Analysis: From Classical Optimization to Learning-based Matching

(ends 5:00 PM)

Workshop:

WorldModelBench: The First Workshop on Benchmarking World Foundation Models

(ends 12:00 PM)

8:15 a.m.

Workshop:

11th Workshop on Medical Computer Vision

(ends 5:30 PM)

8:30 a.m.

Workshop:

Women in Computer Vision

(ends 1:30 PM)

Workshop:

PixFoundation: Workshop on Pixel-level Vision Foundation Models

(ends 5:30 PM)

Workshop:

SyntaGen: Harnessing Generative Models for Synthetic Visual Datasets

(ends 12:30 PM)

Workshop:

What is Next in Multimodal Foundation Models?

(ends 12:30 PM)

Workshop:

21th Workshop on Perception Beyond the Visible Spectrum (PBVS'2025)

(ends 5:10 PM)

Workshop:

VAND: Visual Anomaly and Novelty Detection - 3rd Edition

(ends 12:30 PM)

Workshop:

3D Digital Twin: Progress, Challenges, and Future Directions

(ends 5:15 PM)

Workshop:

The 5th Workshop of Adversarial Machine Learning on Computer Vision: Foundation Models + X

(ends 12:00 PM)

8:45 a.m.

Workshop:

Vision Meets Physics: Synergizing Physical Simulation and Computer Vision

(ends 5:30 PM)

Workshop:

Second Joint Egocentric Vision (EgoVis) Workshop

(ends 5:30 PM)

Workshop:

The 3rd Workshop on Sign Language Recognition, Translation and Production

(ends 12:30 PM)

8:50 a.m.

Workshop:

2nd Workshop on Embodied "Humans": Symbiotic Intelligence between Virtual Humans and Humanoid Robots

(ends 5:00 PM)

Workshop:

ScanNet++ Novel View Synthesis and 3D Semantic Understanding Challenge

(ends 12:30 PM)

8:55 a.m.

Workshop:

Rhobin2025: The Third Rhobin Challenge on Reconstruction of Human-Object Interaction

(ends 12:30 PM)

9 a.m.

Workshop:

Multi-Agent Embodied Intelligent Systems Meet Generative-AI Era: Opportunities, Challenges and Futures

(ends 6:00 PM)

Workshop:

Test-time Scaling for Computer Vision

(ends 12:30 PM)

Workshop:

Spatial Intelligence for Cultural Heritage

(ends 1:00 PM)

Workshop:

AI for Content Creation

(ends 6:00 PM)

Workshop:

Visual Generative Modeling: What’s After Diffusion?

(ends 5:30 PM)

Workshop:

LOVE: Multimodal Video Agent

(ends 12:00 PM)

Workshop:

Vision Language Models For All: Building Geo-Diverse and Culturally Aware Vision-Language Models

(ends 6:00 PM)

Workshop:

7th Safe Artificial Intelligence for All Domains (SAIAD)

(ends 5:00 PM)

Workshop:

Mechanistic Interpretability for Vision

(ends 5:30 PM)

Workshop:

Another Brick in the AI Wall: Building Practical Solutions from Theoretical Foundations

(ends 12:30 PM)

Workshop:

6th Embodied AI Workshop (EAI)

(ends 6:00 PM)

9:20 a.m.

Workshop:

5th Workshop on CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling

(ends 5:50 PM)

9:30 a.m.

Workshop:

Computer Vision for Drug Discovery: Where are we and What is Beyond?

(ends 5:30 PM)

Workshop:

Agent in Interaction, from Humans to Robots

(ends 5:30 PM)

noon

Workshop:

The Seventh Workshop on Precognition: Seeing through the Future

(ends 5:30 PM)

12:30 p.m.

Workshop:

Multi-modal Learning for Materials Science

(ends 5:00 PM)

Workshop:

2nd Workshop on Efficient and On-Device Generation (EDGE)

(ends 5:00 PM)

Workshop:

Visual Modeling Challenges for 2D-3D Virtual Try-On

(ends 5:00 PM)

1 p.m.

Workshop:

AI for Creative Visual Content Generation, Editing and Understanding

(ends 6:00 PM)

Tutorial:

Full-Stack, GPU-based Acceleration of Deep Learning and Foundation Models

(ends 5:00 PM)

Workshop:

Workshop on Perception for Industrial Robotics Automation

(ends 6:00 PM)

Tutorial:

Power-efficient neural networks using low-precision data types and quantization

(ends 5:00 PM)

Workshop:

Open-World 3D Scene Understanding with Foundation Models

(ends 6:00 PM)

Workshop:

Domain Generalization: Evolution, Breakthroughs, and Future Horizons

(ends 5:00 PM)

Workshop:

The first Workshop on Enforcing Geometric, Physical, Topological, and Functional Inductive Bias in 3D Generation

(ends 5:30 PM)

Workshop:

11th IEEE International Workshop on Computer Vision in Sports

(ends 5:15 PM)

Workshop:

8th Workshop and Competition on Affective & Behavior Analysis in-the-wild

(ends 6:00 PM)

Tutorial:

Recent Advances in Vision Foundation Models

(ends 5:00 PM)

Tutorial:

Intelligent Healthcare based on Cameras and Wireless Sensors

(ends 5:00 PM)

Workshop:

First Workshop on Experimental Model Auditing via Controllable Synthesis (EMACS)

(ends 5:00 PM)

Workshop:

The 6th International Workshop and Prize Challenge on Agriculture-Vision: Challenges & Opportunities for Computer Vision in Agriculture in conjunction with IEEE CVPR 2025

(ends 5:00 PM)

Tutorial:

Multimodal Mathematical Reasoning: Frontiers in Integrating Vision, Language, and Symbolic Representations

(ends 5:00 PM)

Workshop:

VizWiz Grand Challenge

(ends 5:30 PM)

Tutorial:

Identifying Structure in Data: All you need to know about Dimensionality Reduction, Clustering and more

(ends 5:00 PM)

1:20 p.m.

Workshop:

The 4th Workshop on Transformers for Vision

(ends 6:00 PM)

1:30 p.m.

Workshop:

ReGenAI: Second Workshop on Responsible Generative AI

(ends 5:45 PM)

Workshop:

Catch UAVs that Want to Watch You: Detection and Tracking of Unmanned Aerial Vehicle (UAV) in the Wild and the 4th Anti-UAV Workshop & Challenge

(ends 5:00 PM)

Workshop:

Workshop on 3D Human Understanding

(ends 6:30 PM)

1:45 p.m.

Workshop:

Real-to-Sim: Bridging the Gap between Neural Rendering and Robot Learning

(ends 5:30 PM)

Workshop:

Physics-inspired 3D Vision and Imaging

(ends 5:30 PM)

2 p.m.

Workshop:

(4th) Monocular Depth Estimation Challenge

(ends 6:00 PM)

FRI 13 JUN

7 a.m.

Registration / Badge Pickup

(ends 6:00 PM)

Break:

Breakfast

(ends 9:00 AM)

8 a.m.

Poster Setup:

Poster Setup

(ends 8:30 AM)

8:30 a.m.

Remarks:

Welcome & Awards

(ends 9:00 AM)

9 a.m.

Oral Session 1A: Image and Video Synthesis [9:00-10:15]

Orals 9:00-10:15

[9:00] Motion Prompting: Controlling Video Generation with Motion Trajectories

[9:15] Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise

[9:30] LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping

[9:45] Alias-Free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space

[10:00] RandAR: Decoder-only Autoregressive Visual Generation in Random Orders

(ends 10:15 AM)

Oral Session 1B: Interpretability and Evaluation [9:00-10:15]

Orals 9:00-10:15

[9:00] OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

[9:15] LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions

[9:30] Do We Always Need the Simplicity Bias? Looking for Optimal Inductive Biases in the Wild

[9:45] Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

[10:00] Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector

(ends 10:15 AM)

Oral Session 1C: Image Processing and Deep Architectures [9:00-10:15]

Orals 9:00-10:15

[9:00] CleanDIFT: Diffusion Features without Noise

[9:15] OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels

[9:30] Towards Explicit Geometry-Reflectance Collaboration for Generalized LiDAR Segmentation in Adverse Weather

[9:45] DiffFNO: Diffusion Fourier Neural Operator

[10:00] Removing Reflections from RAW Photos

(ends 10:15 AM)

10:30 a.m.

Poster Session 1 [10:30-12:30]

Posters 10:30-

Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing

Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation

Sonic: Shifting Focus to Global Audio Perception in Portrait Animation

Towards High-fidelity 3D Talking Avatar with Personalized Dynamic Texture

Electromyography-Informed Facial Expression Reconstruction for Physiological-Based Synthesis and Analysis

High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model

Quaffure: Real-Time Quasi-Static Neural Hair Simulation

GPAvatar: High-fidelity Head Avatars by Learning Efficient Gaussian Projections

HERA: Hybrid Explicit Representation for Ultra-Realistic Head Avatars

GASP: Gaussian Avatars with Synthetic Priors

FRESA: Feedforward Reconstruction of Personalized Skinned Avatars from Few Images

DAGSM: Disentangled Avatar Generation with GS-enhanced Mesh

HumanRig: Learning Automatic Rigging for Humanoid Character in a Large Scale Dataset

SKDream: Controllable Multi-view and 3D Generation with Arbitrary Skeletons

FreeUV: Ground-Truth-Free Realistic Facial UV Texture Recovery via Cross-Assembly Inference Strategy

MultiGO: Towards Multi-level Geometry Learning for Monocular 3D Textured Human Reconstruction

GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior

Shining Yourself: High-Fidelity Ornaments Virtual Try-on with Diffusion Model

SynthLight: Portrait Relighting with Diffusion Model by Learning to Re-render Synthetic Faces

Comprehensive Relighting: Generalizable and Consistent Monocular Human Relighting and Harmonization

Polarized Color Screen Matting

SLVR: Super-Light Visual Reconstruction via Blueprint Controllable Convolutions and Exploring Feature Diversity Representation

Proximal Algorithm Unrolling: Flexible and Efficient Reconstruction Networks for Single-Pixel Imaging

Glossy Object Reconstruction with Cost-effective Polarized Acquisition

Blurry-Edges: Photon-Limited Depth Estimation from Defocused Boundaries

LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting

LEDiff: Latent Exposure Diffusion for HDR Generation

IRIS: Inverse Rendering of Indoor Scenes from Low Dynamic Range Images

Differentiable Inverse Rendering with Interpretable Basis BRDFs

Hardware-Rasterized Ray-Based Gaussian Splatting

TensoFlow: Tensorial Flow-based Sampler for Inverse Rendering

LIRM: Large Inverse Rendering Model for Progressive Reconstruction of Shape, Materials and View-dependent Radiance Fields

Gaussian Splashing: Unified Particles for Versatile Motion Synthesis and Rendering

Accurate Differential Operators for Hybrid Neural Fields

Learning Extremely High Density Crowds as Active Matters

TexGaussian: Generating High-quality PBR Material via Octree-based 3D Gaussian Splatting

Real-time Free-view Human Rendering from Sparse-view RGB Videos using Double Unprojected Textures

RoomPainter: View-Integrated Diffusion for Consistent Indoor Scene Texturing

MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D

Fancy123: One Image to High-Quality 3D Mesh Generation via Plug-and-Play Deformation

ShapeShifter: 3D Variations Using Multiscale and Sparse Point-Voxel Diffusion

MeshArt: Generating Articulated Meshes with Structure-Guided Transformers

SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation

PrEditor3D: Fast and Precise 3D Shape Editing

LT3SD: Latent Trees for 3D Scene Diffusion

iSegMan: Interactive Segment-and-Manipulate 3D Gaussians

LOD-GS: Achieving Levels of Detail using Scalable Gaussian Soup

MaskGaussian: Adaptive 3D Gaussian Representation from Probabilistic Masks

NTR-Gaussian: Nighttime Dynamic Thermal Reconstruction with 4D Gaussian Splatting Based on Thermodynamics

DropoutGS: Dropping Out Gaussians for Better Sparse-view Rendering

S2Gaussian: Sparse-View Super-Resolution 3D Gaussian Splatting

DeSplat: Decomposed Gaussian Splatting for Distractor-Free Rendering

Neural Hierarchical Decomposition for Single Image Plant Modeling

Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation

Digital Twin Catalog: A Large-Scale Photorealistic 3D Object Digital Twin Dataset

Zero-Shot Novel View and Depth Synthesis with Multi-View Geometric Diffusion

NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images

HybridGS: Decoupling Transients and Statics with 2D and 3D Gaussian Splatting

Wonderland: Navigating 3D Scenes from a Single Image

SpatialDreamer: Self-supervised Stereo Video Synthesis from Monocular Input

StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models

DroneSplat: 3D Gaussian Splatting for Robust 3D Reconstruction from In-the-Wild Drone Imagery

IndoorGS: Geometric Cues Guided Gaussian Splatting for Indoor Scene Reconstruction

MAC-Ego3D: Multi-Agent Gaussian Consensus for Real-Time Collaborative Ego-Motion and Photorealistic 3D Reconstruction

ShowMak3r: Compositional TV Show Reconstruction

4DGC: Rate-Aware 4D Gaussian Compression for Efficient Streamable Free-Viewpoint Video

HiMoR: Monocular Deformable Gaussian Reconstruction with Hierarchical Motion Representation

EnliveningGS: Active Locomotion of 3DGS

HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation

Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence

Active Hyperspectral Imaging Using an Event Camera

SphereUFormer: A U-Shaped Transformer for Spherical 360 Perception

Decoupling Fine Detail and Global Geometry for Compressed Depth Map Super-Resolution

A Unified Image-Dense Annotation Generation Model for Underwater Scenes

Active Event-based Stereo Vision

PanDA: Towards Panoramic Depth Anything with Unlabeled Panoramas and Mobius Spatial Augmentation

Flow-NeRF: Joint Learning of Geometry, Poses, and Dense Flow within Unified Neural Representations

OmniStereo: Real-time Omnidireactional Depth Estimation with Multiview Fisheye Cameras

Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail

UniK3D: Universal Camera Monocular 3D Estimation

Structure-from-Motion with a Non-Parametric Camera Model

MUSt3R: Multi-view Network for Stereo 3D Reconstruction

Extreme Rotation Estimation in the Wild

Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors

Gaussian Splatting Feature Fields for (Privacy-Preserving) Visual Localization

Dense Match Summarization for Faster Two-view Estimation

Cross-View Completion Models are Zero-shot Correspondence Estimators

Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video

EquiPose: Exploiting Permutation Equivariance for Relative Camera Pose Estimation

SemAlign3D: Semantic Correspondence between RGB-Images through Aligning 3D Object-Class Representations

PromptHMR: Promptable Human Mesh Recovery

DynPose: Largely Improving the Efficiency of Human Pose Estimation by a Simple Dynamic Framework

Rethinking Correspondence-based Category-Level Object Pose Estimation

UA-Pose: Uncertainty-Aware 6D Object Pose Estimation and Online Object Completion with Partial References

PlanarSplatting: Accurate Planar Surface Reconstruction in 3 Minutes

Prior-free 3D Object Tracking

Progressive Correspondence Regenerator for Robust 3D Registration

CaMuViD: Calibration-Free Multi-View Detection

A New Statistical Model of Star Speckles for Learning to Detect and Characterize Exoplanets in Direct Imaging Observations

AG-VPReID: A Challenging Large-Scale Benchmark for Aerial-Ground Video-based Person Re-Identification

MambaVO: Deep Visual Odometry Based on Sequential Matching Refinement and Training Smoothing

Point-Cache: Test-time Dynamic and Hierarchical Cache for Robust and Generalizable Point Cloud Analysis

HotSpot: Signed Distance Function Optimization with an Asymptotically Sufficient Condition

High-quality Point Cloud Oriented Normal Estimation via Hybrid Angular and Euclidean Distance Encoding

A Lightweight UDF Learning Framework for 3D Reconstruction Based on Local Shape Functions

GenPC: Zero-shot Point Cloud Completion via 3D Generative Priors

UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting

DeepLA-Net: Very Deep Local Aggregation Networks for Point Cloud Analysis

SAMBLE: Shape-Specific Point Cloud Sampling for an Optimal Trade-Off Between Local Detail and Global Uniformity

PO3AD: Predicting Point Offsets toward Better 3D Point Cloud Anomaly Detection

HeMoRa: Unsupervised Heuristic Consensus Sampling for Robust Point Cloud Registration

LogoSP: Local-global Grouping of Superpoints for Unsupervised Semantic Segmentation of 3D Point Clouds

AirRoom: Objects Matter in Room Reidentification

Open-Canopy: Towards Very High Resolution Forest Monitoring

UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection

Learning to Detect Objects from Multi-Agent LiDAR Scans without Manual Labels

Towards Explicit Geometry-Reflectance Collaboration for Generalized LiDAR Segmentation in Adverse Weather

HiLoTs: High-Low Temporal Sensitive Representation Learning for Semi-Supervised LiDAR Segmentation in Autonomous Driving

A Dataset for Semantic Segmentation in the Presence of Unknowns

MAD: Memory-Augmented Detection of 3D Objects

High Temporal Consistency through Semantic Similarity Propagation in Semi-Supervised Video Semantic Segmentation for Autonomous Flight

EventFly: Event Camera Perception from Ground to the Sky

MNE-SLAM: Multi-Agent Neural SLAM for Mobile Robots

BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance

Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction

STCOcc: Sparse Spatial-Temporal Cascade Renovation for 3D Occupancy and Scene Flow Prediction

Towards Satellite Image Road Graph Extraction: A Global-Scale Dataset and A Novel Method

LiDAR-RT: Gaussian-based Ray Tracing for Dynamic LiDAR Re-simulation

FlexDrive: Toward Trajectory Flexibility in Driving Scene Gaussian Splatting Reconstruction and Rendering

ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration

SceneDiffuser++: City-Scale Traffic Simulation via a Generative World Model

Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation

One is Plenty: A Polymorphic Feature Interpreter for Immutable Heterogeneous Collaborative Perception

GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving

ModeSeq: Taming Sparse Multimodal Motion Prediction with Sequential Mode Modeling

S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Model with Spatio-Temporal Visual Representation

JTD-UAV: MLLM-Enhanced Joint Tracking and Description Framework for Anti-UAV Systems

Adapting to Observation Length of Trajectory Prediction via Contrastive Learning

Asynchronous Collaborative Graph Representation for Frames and Events

METASCENES: Towards Automated Replica Creation for Real-world 3D Scans

GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency

SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation

RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete

G3Flow: Generative 3D Semantic Flow for Pose-aware and Generalizable Object Manipulation

DexHandDiff: Interaction-aware Diffusion Planning for Adaptive Dexterous Manipulation

GraphMimic: Graph-to-Graphs Generative Modeling from Videos for Policy Learning

CORE4D: A 4D Human-Object-Human Interaction Dataset for Collaborative Object REarrangement

PICO: Reconstructing 3D People In Contact with Objects

Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes

HaWoR: World-Space Hand Motion Reconstruction from Egocentric Videos

ParaHome: Parameterizing Everyday Home Activities Towards 3D Generative Modeling of Human-Object Interactions

DiSRT-In-Bed: Diffusion-Based Sim-to-Real Transfer Framework for In-Bed Human Mesh Recovery

EnvPoser: Environment-aware Realistic Human Motion Estimation from Sparse Observations with Uncertainty Modeling

From Sparse Signal to Smooth Motion: Real-Time Motion Generation with Rolling Prediction Models

ALIEN: Implicit Neural Representations for Human Motion Prediction under Arbitrary Latency

Nonisotropic Gaussian Diffusion for Realistic 3D Human Motion Prediction

Stochastic Human Motion Prediction with Memory of Action Transition and Action Characteristic

ArtFormer: Controllable Generation of Diverse 3D Articulated Objects

FinePhys: Fine-grained Human Action Generation by Explicitly Incorporating Physical Laws for Effective Skeletal Guidance

Shape My Moves: Text-Driven Shape-Aware Synthesis of Human Motions

AniMo: Species-Aware Model for Text-Driven Animal Motion Generation

Exploring Timeline Control for Facial Motion Generation

TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation

Exploring Temporally-Aware Features for Point Tracking

HumanMM: Global Human Motion Recovery from Multi-shot Videos

EDCFlow: Exploring Temporally Dense Difference Maps for Event-based Optical Flow Estimation

Removing Reflections from RAW Photos

Explicit Depth-Aware Blurry Video Frame Interpolation Guided by Differential Curves

DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

Motion Prompting: Controlling Video Generation with Motion Trajectories

Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise

Motion Modes: What Could Happen Next?

FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis

ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

Tora: Trajectory-oriented Diffusion Transformer for Video Generation

Align-A-Video: Deterministic Reward Tuning of Image Diffusion Models for Consistent Video Editing

Classic Video Denoising in a Machine Learning World: Robust, Fast, and Controllable

Augmented Deep Contexts for Spatially Embedded Video Coding

EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation

Continuous Space-Time Video Resampling with Invertible Motion Steganography

Learning Phase Distortion with Selective State Space Models for Video Turbulence Mitigation

VideoGigaGAN: Towards Detail-rich Video Super-Resolution

KVQ: Boosting Video Quality Assessment via Saliency-guided Local Perception

SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration

Multi-Modal Synergistic Implicit Image Enhancement for Efficient Optical Flow Estimation

Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency

Diffusion-based Event Generation for High-Quality Image Deblurring

The Change You Want To Detect: Semantic Change Detection In Earth Observation With Hybrid Data Generationf

Illumination Spectrum Estimation for Multispectral Images via Surface Reflectance Modeling and Spatial-Spectral Feature Generation

DCEvo: Discriminative Cross-Dimensional Evolutionary Learning for Infrared and Visible Image Fusion

Binarized Neural Network for Multi-spectral Image Fusion

DiffFNO: Diffusion Fourier Neural Operator

Bridging the Vision-Brain Gap with an Uncertainty-Aware Blur Prior

Tokenize Image Patches: Global Context Fusion for Effective Haze Removal in Large Images

Towards Lossless Implicit Neural Representation via Bit Plane Decomposition

Progressive Focused Transformer for Single Image Super-Resolution

HIIF: Hierarchical Encoding based Implicit Image Function for Continuous Super-resolution

A Regularization-Guided Equivariant Approach for Image Restoration

Augmenting Perceptual Super-Resolution via Image Quality Predictors

Rethinking Reconstruction and Denoising in the Dark: New Perspective, General Architecture and Beyond

Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach

Distilling Spatially-Heterogeneous Distortion Perception for Blind Image Quality Assessment

Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models

Segment Any-Quality Images with Generative Latent Space Enhancement

Traversing Distortion-Perception Tradeoff using a Single Score-Based Generative Model

Sampling Innovation-Based Adaptive Compressive Sensing

Zero-Shot Image Restoration Using Few-Step Guidance of Consistency Models (and Beyond)

Hierarchical Adaptive Filtering Network for Text Image Specular Highlight Removal

Erase Diffusion: Empowering Object Removal Through Calibrating Diffusion Pathways

Balanced Rate-Distortion Optimization in Learned Image Compression

Alias-Free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space

LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping

RAD: Region-Aware Diffusion Models for Image Inpainting

Bridging the Gap between Gaussian Diffusion Models and Universal Quantization for Image Compression

CleanDIFT: Diffusion Features without Noise

FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

DiC: Rethinking Conv3x3 Designs in Diffusion Models

SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

RandAR: Decoder-only Autoregressive Visual Generation in Random Orders

Learning Flow Fields in Attention for Controllable Person Image Generation

Nested Diffusion Models Using Hierarchical Latent Priors

Adaptive Non-Uniform Timestep Sampling for Accelerating Diffusion Model Training

Scaling Inference Time Compute for Diffusion Models

HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

VideoDirector: Precise Video Editing via Text-to-Video Models

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

AKiRa: Augmentation Kit on Rays for Optical Video Generation

TCFG: Tangential Damping Classifier-free Guidance

StyleMaster: Stylize Your Video with Artistic Generation and Translation

Conditional Balance: Improving Multi-Conditioning Trade-Offs in Image Generation

FDS: Frequency-Aware Denoising Score for Text-Guided Latent Diffusion Image Editing

FeedEdit: Text-Based Image Editing with Dynamic Feedback Regulation

One Diffusion to Generate Them All

MoEdit: On Learning Quantity Perception for Multi-object Image Editing

InsightEdit: Towards Better Instruction Following for Image Editing

Instruction-based Image Manipulation by Watching How Things Move

TFCustom: Customized Image Generation with Time-Aware Frequency Feature Guidance

OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

PreciseCam: Precise Camera Control for Text-to-Image Generation

Science-T2I: Addressing Scientific Illusions in Image Synthesis

Type-R: Automatically Retouching Typos for Text-to-Image Generation

Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution

GPS as a Control Signal for Image Generation

Dual Diffusion for Unified Image Generation and Understanding

Compass Control: Multi Object Orientation Control for Text-to-Image Generation

MC^2: Multi-concept Guidance for Customized Multi-concept Generation

Synthetic Data is an Elegant GIFT for Continual Vision-Language Models

Curriculum Direct Preference Optimization for Diffusion and Consistency Models

DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles

SerialGen: Personalized Image Generation by First Standardization Then Personalization

Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation

VinaBench: Benchmark for Faithful and Consistent Visual Narratives

CoSER: Towards Consistent Dense Multiview Text-to-Image Generator for 3D Creation

ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary

AutoPresent: Designing Structured Visuals from Scratch

LineArt: A Knowledge-guided Training-free High-quality Appearance Transfer for Design Drawing with Diffusion Model

ChatGarment: Garment Estimation, Generation and Editing via Large Language Models

Rethinking Personalized Aesthetics Assessment: Employing Physique Aesthetics Assessment as An Exemplification

ConceptGuard: Continual Personalized Text-to-Image Generation with Forgetting and Confusion Mitigation

DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture

Memories of Forgotten Concepts

Plug-and-Play Interpretable Responsible Text-to-Image Generation via Dual-Space Multi-facet Concept Control

ID-Patch: Robust ID Association for Group Photo Personalization

Not Just Text: Uncovering Vision Modality Typographic Threats in Image Generation Models

OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking

IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation

Image Generation Diversity Issues and How to Tame Them

Forensic Self-Descriptions Are All You Need for Zero-Shot Detection, Open-Set Source Attribution, and Clustering of AI-generated Images

ORIDa: Object-centric Real-world Image Composition Dataset

SINR: Sparsity Driven Compressed Implicit Neural Representations

Tuning the Frequencies: Robust Training for Sinusoidal Neural Networks

GA3CE: Unconstrained 3D Gaze Estimation with Gaze-Aware 3D Context Encoding

De^2Gaze: Deformable and Decoupled Representation Learning for 3D Gaze Estimation

FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation

Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition

Improving Sound Source Localization with Joint Slot Attention on Image and Audio

Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics

Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds

HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization

Precise Event Spotting in Sports Videos: Solving Long-Range Dependency and Class Imbalance

The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation

Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation

V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents

FineVQ: Fine-Grained User Generated Content Video Quality Assessment

VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary

Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs

LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant

AVQACL: A Novel Benchmark for Audio-Visual Question Answering Continual Learning

Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

STEP: Enhancing Video-LLMs’ Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training

VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding

PAVE: Patching and Adapting Video Large Language Models

BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding

Online Video Understanding: OVBench and VideoChat-Online

Localizing Events in Videos with Multimodal Queries

SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis

EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

VideoGEM: Training-free Action Grounding in Videos

STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding

SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation

Segment Any Motion in Videos

SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost

RipVIS: Rip Currents Video Instance Segmentation Benchmark for Beach Monitoring and Safety

MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Action Anticipation

TAMT: Temporal-Aware Model Tuning for Cross-Domain Few-Shot Action Recognition

Bridging Gait Recognition and Large Language Models Sequence Modeling

DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Videos

Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observations

NoPain: No-box Point Cloud Attack via Optimal Transport Singular Boundary

AI-Face: A Million-Scale Demographically Annotated AI-Generated Face Dataset and Fairness Benchmark

Improving the Transferability of Adversarial Attacks on Face Recognition with Diverse Parameters Augmentation

GIF: Generative Inspiration for Face Recognition at Scale

Towards Effective and Sparse Adversarial Attack on Spiking Neural Networks via Breaking Invisible Surrogate Gradients

Brain-Inspired Spiking Neural Networks for Energy-Efficient Object Detection

BHViT: Binarized Hybrid Vision Transformer

DKC: Differentiated Knowledge Consolidation for Cloth-Hybrid Lifelong Person Re-identification

2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification

GauCho: Gaussian Distributions with Cholesky Decomposition for Oriented Object Detection

Camouflage Anything: Learning to Hide using Controlled Out-painting and Representation Engineering

AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation

ROS-SAM: High-Quality Interactive Segmentation for Remote Sensing Moving Object

Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking

POp-GS: Next Best View in 3D-Gaussian Splatting with P-Optimality

Rethinking End-to-End 2D to 3D Scene Segmentation in Gaussian Splatting

Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding

OnlineAnySeg: Online Zero-Shot 3D Segmentation by Visual Foundation Model Guided 2D Mask Merging

SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

RELOCATE: A Simple Training-Free Baseline for Visual Query Localization Using Region-Based Representations

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning

Cross-Modal 3D Representation with Multi-View Images and Point Clouds

Learning Visual Composition through Improved Semantic Guidance

Beyond Human Perception: Understanding Multi-Object World from Monocular View

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer

Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model

Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks

Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning

Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

Do Visual Imaginations Improve Vision-and-Language Navigation Agents?

HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator

Words or Vision: Do Vision-Language Models Have Blind Faith in Text?

VisionArena: 230k Real World User-VLM Conversations with Preference Labels

Knowledge-Aligned Counterfactual-Enhancement Diffusion Perception for Unsupervised Cross-Domain Visual Emotion Recognition

PEACE: Empowering Geologic Map Holistic Understanding with MLLMs

FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding

GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model

Chat-based Person Retrieval via Dialogue-Refined Cross-Modal Alignment

Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception

CCIN: Compositional Conflict Identification and Neutralization for Composed Image Retrieval

Imagine and Seek: Improving Composed Image Retrieval with an Imagined Proxy

CoLLM: A Large Language Model for Composed Image Retrieval

ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding

LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

Docopilot: Improving Multimodal Models for Document-Level Understanding

DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding

Diffusion Bridge: Leveraging Diffusion Model to Reduce the Modality Gap Between Text and Vision for Zero-Shot Image Captioning

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Towards Improved Text-Aligned Codebook Learning: Multi-Hierarchical Codebook-Text Alignment with Long Text

GOAL: Global-local Object Alignment Learning

FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training

Data-Free Group-Wise Fully Quantized Winograd Convolution via Learnable Scales

VladVA: Discriminative Fine-tuning of LVLMs

Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding

NVILA: Efficient Frontier Visual Language Models

Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices

Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector

MBQ: Modality-Balanced Quantization for Large Vision-Language Models

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Large Model Enhancement

VASparse: Towards Efficient Visual Hallucination Mitigation via Visual-Aware Token Sparsification

Stop Learning it all to Mitigate Visual Hallucination, Focus on the Hallucination Target.

ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models

Hyperbolic Safety-Aware Vision-Language Models

Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models

Joint Vision-Language Social Bias Removal for CLIP

Post-pre-training for Modality Alignment in Vision-Language Foundation Models

Context-Aware Multimodal Pretraining

Adaptive Parameter Selection for Tuning Vision-Language Models

OpenSDI: Spotting Diffusion-Generated Images in the Open World

SnowMaster: Comprehensive Real-world Image Desnowing via MLLM with Multi-Model Feedback Optimization

OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels

Do We Always Need the Simplicity Bias? Looking for Optimal Inductive Biases in the Wild

LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions

SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models

Query Efficient Black-Box Visual Prompting with Subspace Learning

Plug-and-Play PPO: An Adaptive Point Prompt Optimizer Making SAM Greater

Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning

DA-VPT: Semantic-Guided Visual Prompt Tuning for Vision Transformers

CoE: Chain-of-Explanation via Automatic Visual Concept Circuit Description and Polysemanticity Quantification

Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis

Attention IoU: Examining Biases in CelebA using Attention Maps

ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

Seeing More with Less: Human-like Representations in Vision Models

Argus: A Compact and Versatile Foundation Model for Vision

Test-Time Fine-Tuning of Image Compression Models for Multi-Task Adaptability

L-SWAG: Layer-Sample Wise Activation with Gradients Information for Zero-Shot NAS on Vision Transformers

NADER: Neural Architecture Design via Multi-Agent Collaboration

Quantization without Tears

Parallel Sequence Modeling via Generalized Spatial Propagation Network

MambaOut: Do We Really Need Mamba for Vision?

MobileMamba: Lightweight Multi-Receptive Visual Mamba Network

ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices

Associative Transformer

Rashomon Sets for Prototypical-Part Networks: Editing Interpretable Models in Real-Time

SGC-Net: Stratified Granular Comparison Network for Open-Vocabulary HOI Detection

CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models

Understanding Fine-tuning CLIP for Open-vocabulary Semantic Segmentation in Hyperbolic Space

Scaling up Image Segmentation across Data and Tasks

DSV-LFS: Unifying LLM-Driven Semantic Cues with Visual Features for Robust Few-Shot Segmentation

Rethinking Query-based Transformer for Continual Image Segmentation

Universal Domain Adaptation for Semantic Segmentation

The Devil is in Low-Level Features for Cross-Domain Few-Shot Segmentation

EZSR: Event-based Zero-Shot Recognition

Single Domain Generalization for Few-Shot Counting via Universal Representation Matching

Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport

Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification

SimLTD: Simple Supervised and Semi-Supervised Long-Tailed Object Detection

Percept, Memory, and Imagine: World Feature Simulating for Open-Domain Unknown Object Detection

Large Self-Supervised Models Bridge the Gap in Domain Adaptive Object Detection

MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism

SET: Spectral Enhancement for Tiny Object Detection

Leveraging Perturbation Robustness to Enhance Out-of-Distribution Detection

PIAD: Pose and Illumination agnostic Anomaly Detection

AA-CLIP: Enhancing Zero-Shot Anomaly Detection via Anomaly-Aware CLIP

AnomalyNCD: Towards Novel Anomaly Class Discovery in Industrial Scenarios

One-for-More: Continual Diffusion Model for Anomaly Detection

GeoMM: On Geodesic Perspective for Multi-modal Learning

HOT: Hadamard-based Optimized Training

DELT: A Simple Diversity-driven EarlyLate Training for Dataset Distillation

Flexible Group Count Enables Hassle-Free Structured Pruning

WAVE: Weight Templates for Adaptive Initialization of Variable-sized Models

IterIS: Iterative Inference-Solving Alignment for LoRA Merging

Boosting Domain Incremental Learning: Selecting the Optimal Parameters is All You Need

Ferret: An Efficient Online Continual Learning Framework under Varying Memory Constraints

Learning Conditional Space-Time Prompt Distributions for Video Class-Incremental Learning

Handling Spatial-Temporal Data Heterogeneity for Federated Continual Learning via Tail Anchor

Adapter Merging with Centroid Prototype Mapping for Scalable Class-Incremental Learning

Order-Robust Class Incremental Learning: Graph-Driven Dynamic Similarity Grouping

When Domain Generalization meets Generalized Category Discovery: An Adaptive Task-Arithmetic Driven Approach

Link-based Contrastive Learning for One-Shot Unsupervised Domain Adaptation

Distinguish Then Exploit: Source-free Open Set Domain Adaptation via Weight Barcode Estimation and Sparse Label Assignment

Instance-wise Supervision-level Optimization in Active Learning

Towards Source-Free Machine Unlearning

Sufficient Invariant Learning for Distribution Shift

CADRef: Robust Out-of-Distribution Detection via Class-Aware Decoupled Relative Feature Leveraging

Federated Learning with Domain Shift Eraser

AFL: A Single-Round Analytic Approach for Federated Learning with Pre-trained Models

Fortifying Federated Learning Towards Trustworthiness via Auditable Data Valuation and Verifiable Client Contribution

ESC: Erasing Space Concept for Knowledge Deletion

Deterministic Certification of Graph Neural Networks against Graph Poisoning Attacks with Arbitrary Perturbations

Simplification Is All You Need against Out-of-Distribution Overconfidence

MOS-Attack: A Scalable Multi-objective Adversarial Attack Framework

Automated Proof of Polynomial Inequalities via Reinforcement Learning

Deep Fair Multi-View Clustering with Attention KAN

Imputation-free and Alignment-free: Incomplete Multi-view Clustering Driven by Consensus Semantic Learning

Improve Representation for Imbalanced Regression through Geometric Constraints

MODfinity: Unsupervised Domain Adaptation with Multimodal Information Flow Intertwining

Distilled Prompt Learning for Incomplete Multimodal Survival Prediction

LMO: Linear Mamba Operator for MRI Reconstruction

CXPMRG-Bench: Pre-training and Benchmarking for X-ray Medical Report Generation on CheXpert Plus Dataset

SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding

Patient-Level Anatomy Meets Scanning-Level Physics: Personalized Federated Low-Dose CT Denoising Empowered by Large Language Model

Multi-modal Vision Pre-training for Medical Image Analysis

Steady Progress Beats Stagnation: Mutual Aid of Foundation and Conventional Models in Mixed Domain Semi-Supervised Medical Image Segmentation

Revisiting MAE Pre-training for 3D Medical Image Segmentation

SuperLightNet: Lightweight Parameter Aggregation Network for Multimodal Brain Tumor Segmentation

EchoONE: Segmenting Multiple Echocardiography Planes in One Model

AeSPa : Attention-guided Self-supervised Parallel Imaging for MRI Reconstruction

SACB-Net: Spatial-awareness Convolutions for Medical Image Registration

Segmenting Maxillofacial Structures in CBCT Volumes

DocVLM: Make Your VLM an Efficient Reader

Learning Heterogeneous Tissues with Mixture of Experts for Gigapixel Whole Slide Images

(ends 12:30 PM)

Demonstration:

Demos

(ends 12:30 PM)

Art Program [10:30-6:00]

(ends 6:00 PM)

1 p.m.

Oral Session 2A: 3D Computer Vision [1:00-2:30]

Orals 1:00-2:30

[1:00] FoundationStereo: Zero-Shot Stereo Matching

[1:15] MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision

[1:30] Multi-view Reconstruction via SfM-guided Monocular Depth Estimation

[1:45] MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds

[2:00] VGGT: Visual Geometry Grounded Transformer

[2:15] CraftsMan3D: High-fidelity Mesh Generation with 3D Native Diffusion and Interactive Geometry Refiner

(ends 2:30 PM)

Oral Session 2B: Human Motion [1:00-2:30]

Orals 1:00-2:30

[1:00] CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models

[1:15] Reanimating Images using Neural Representations of Dynamic Stimuli

[1:30] EgoLM: Multi-Modal Language Model of Egocentric Motions

[1:45] Reconstructing Humans with a Biomechanically Accurate Skeleton

[2:00] MEGA: Masked Generative Autoencoder for Human Mesh Recovery

[2:15] TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization

(ends 2:30 PM)

Oral Session 2C: Temporal Modeling and Action Recognition [1:00-2:30]

Orals 1:00-2:30

[1:00] Descriptor-In-Pixel : Point-Feature Tracking For Pixel Processor Arrays

[1:15] Temporally Consistent Object-Centric Learning by Contrasting Slots

[1:30] Temporal Alignment-Free Video Matching for Few-shot Action Recognition

[1:45] Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models

[2:00] The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition

[2:15] Rethinking Spiking Self-Attention Mechanism: Implementing α-XNOR Similarity Calculation in Spiking Transformers

(ends 2:30 PM)

2:45 p.m.

Keynote:

Exploring the Low Altitude Airspace: From Natural Resource to Economic Engine

Harry Shum

(ends 3:45 PM)

4 p.m.

Poster Session 2 [4:00-6:00]

Posters 4:00-6:00

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Let's Chorus: Partner-aware Hybrid Song-Driven 3D Head Animation

KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation

EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation

X-Dyna: Expressive Dynamic Human Image Animation

Lux Post Facto: Learning Portrait Performance Relighting with Conditional Video Diffusion and a Hybrid Dataset

Monocular and Generalizable Gaussian Talking Head Animation

FATE: Full-head Gaussian Avatar with Textural Editing from Monocular Video

CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models

GAF: Gaussian Avatar Reconstruction from Monocular Videos via Multi-view Diffusion

Vid2Avatar-Pro: Authentic Avatar from Videos in the Wild via Universal Prior

SinGS: Animatable Single-Image Human Gaussian Splats with Kinematic Priors

EasyCraft: A Robust and Efficient Framework for Automatic Avatar Crafting

RigGS: Rigging of 3D Gaussians for Modeling Articulated Objects in Videos

Learning Person-Specific Animatable Face Models from In-the-Wild Images via a Shared Base Model

ControlFace: Harnessing Facial Parametric Control for Face Rigging

HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion

DeClotH: Decomposable 3D Cloth and Human Body Reconstruction from a Single Image

Disentangled Pose and Appearance Guidance for Multi-Pose Generation

Touch2Shape: Touch-Conditioned 3D Diffusion for Shape Exploration and Reconstruction

MangaNinja: Line Art Colorization with Precise Reference Following

HVI: A New Color Space for Low-light Image Enhancement

Flash-Split: 2D Reflection Removal with Flash Cues and Latent Diffusion Separation

Noise Modeling in One Hour: Minimizing Preparation Efforts for Self-supervised Low-Light RAW Image Denoising

Quad-Pixel Image Defocus Deblurring: A New Benchmark and Model

ScribbleLight: Single Image Indoor Relighting with Scribbles

Hearing Anywhere in Any Environment

EnvGS: Modeling View-Dependent Appearance with Environment Gaussian

Geometry Field Splatting with Gaussian Surfels

Locally Orderless Images for Optimization in Differentiable Rendering

Channel-wise Noise Scheduled Diffusion for Inverse Rendering in Indoor Scenes

Feature-Preserving Mesh Decimation for Normal Integration

SGCR: Spherical Gaussians for Efficient 3D Curve Reconstruction

AMR-Transformer: Enabling Efficient Long-range Interaction for Complex Neural Fluid Simulation

MaRI: Material Retrieval Integration across Domains

Spherical Manifold Guided Diffusion Model for Panoramic Image Generation

MeshGen: Generating PBR Textured Mesh with Render-Enhanced Auto-Encoder and Generative Data Augmentation

RASP: Revisiting 3D Anamorphic Art for Shadow-Guided Packing of Irregular Objects

Twinner: Shining Light on Digital Twins in a Few Snaps

CraftsMan3D: High-fidelity Mesh Generation with 3D Native Diffusion and Interactive Geometry Refiner

Kiss3DGen: Repurposing Image Diffusion Models for 3D Asset Generation

PartGen: Part-level 3D Generation and Reconstruction with Multi-view Diffusion Models

FreeScene: Mixed Graph Diffusion for 3D Scene Synthesis from Free Prompts

Reference-Based 3D-Aware Image Editing with Triplanes

WonderWorld: Interactive 3D Scene Generation from a Single Image

UVGS: Reimagining Unstructured 3D Gaussian Splatting using UV Mapping

3D-GSW: 3D Gaussian Splatting for Robust Watermarking

PUP 3D-GS: Principled Uncertainty Pruning for 3D Gaussian Splatting

Gaussian Splatting for Efficient Satellite Image Photogrammetry

HyperGS: Hyperspectral 3D Gaussian Splatting

RoGSplat: Learning Robust Generalizable Human Gaussian Splatting from Sparse Multi-View Images

GaussHDR: High Dynamic Range Gaussian Splatting via Learning Unified 3D and 2D Local Tone Mapping

MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views

Exploiting Deblurring Networks for Radiance Fields

Decompositional Neural Scene Reconstruction with Generative Diffusion Prior

MET3R: Measuring Multi-View Consistency in Generated Images

MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds

MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model

ERUPT: Efficient Rendering with Unposed Patch Transformer

Satellite to GroundScape - Large-scale Consistent Ground View Generation from Satellite Views

GenFusion: Closing the Loop between Reconstruction and Generation via Videos

Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model

Multi-subject Open-set Personalization in Video Generation

Generative Gaussian Splatting for Unbounded 3D City Generation

GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control

Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs

DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes

LIM: Large Interpolator Model for Dynamic Reconstruction

EgoLM: Multi-Modal Language Model of Egocentric Motions

MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds

PhysGen3D: Crafting a Miniature Interactive World from a Single Image

Link to the Past: Temporal Propagation for Fast 3D Human Reconstruction from Monocular Video

The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion

Towards Explainable and Unprecedented Accuracy in Matching Challenging Finger Crease Patterns

One-Step Event-Driven High-Speed Autofocus

Self-Supervised Learning for Color Spike Camera Reconstruction

PS-EIP: Robust Photometric Stereo Based on Event Interval Profile

Efficient Depth Estimation for Unstable Stereo Camera Systems on AR Glasses

Scalable Autoregressive Monocular Depth Estimation

Multi-view Reconstruction via SfM-guided Monocular Depth Estimation

FoundationStereo: Zero-Shot Stereo Matching

MonSter: Marry Monodepth to Stereo Unleashes Power

Dual Exposure Stereo for Extended Dynamic Range 3D Imaging

Adapting Dense Matching for Homography Estimation with Grid-based Acceleration

ProtoDepth: Unsupervised Continual Depth Completion with Prototypes

VGGT: Visual Geometry Grounded Transformer

DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

Descriptor-In-Pixel : Point-Feature Tracking For Pixel Processor Arrays

ScaleLSD: Scalable Deep Line Segment Detection Streamlined

MEGA: Masked Generative Autoencoder for Human Mesh Recovery

Reconstructing Humans with a Biomechanically Accurate Skeleton

EDM: Equirectangular Projection-Oriented Dense Kernelized Feature Matching

Feat2GS: Probing Visual Foundation Models with Gaussian Splatting

FG^2: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching

Stop Walking in Circles! Bailing Out Early in Projected Gradient Descent

Self-Supervised Spatial Correspondence Across Modalities

RDD: Robust Feature Detector and Descriptor using Deformable Transformer

Dense-SfM: Structure from Motion with Dense Consistent Matching

HeatFormer: A Neural Optimizer for Multiview Human Mesh Recovery

DualPM: Dual Posed-Canonical Point Maps for 3D Shape and Pose Reconstruction

iG-6DoF: Model-free 6DoF Pose Estimation for Unseen Object via Iterative 3D Gaussian Splatting

RefPose: Leveraging Reference Geometric Correspondences for Accurate 6D Pose Estimation of Unseen Objects

One2Any: One-Reference 6D Pose Estimation for Any Object

Common3D: Self-Supervised Learning of 3D Morphable Models for Common Objects in Neural Feature Space

ESCAPE: Equivariant Shape Completion via Anchor Point Encoding

Open-World Amodal Appearance Completion

Exploring Historical Information for RGBE Visual Tracking with Mamba

EBS-EKF: Accurate and High Frequency Event-based Star Tracking

MonoDGP: Monocular 3D Object Detection with Decoupled-Query and Geometry-Error Priors

MonoPlace3D: Learning 3D-Aware Object Placement for 3D Monocular Detection

4Deform: Neural Surface Deformation for Robust Shape Interpolation

Toward Robust Neural Reconstruction from Sparse Point Sets

MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision

ArcPro: Architectural Programs for Structured 3D Abstraction of Sparse Points

ColabSfM: Collaborative Structure-from-Motion by Point Cloud Registration

MoST: Efficient Monarch Sparse Tuning for 3D Representation Learning

Flash3D: Super-scaling Point Transformers through Joint Hardware-Geometry Locality

PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning

MICAS: Multi-grained In-Context Adaptive Sampling for 3D Point Cloud Processing

LidarGait++: Learning Local Features and Size Awareness from LiDAR Point Clouds for 3D Gait Recognition

CMMLoc: Advancing Text-to-PointCloud Localization with Cauchy-Mixture-Model Based Framework

HOTFormerLoc: Hierarchical Octree Transformer for Versatile Lidar Place Recognition Across Ground and Aerial Views

ForestLPR: LiDAR Place Recognition in Forests Attentioning Multiple BEV Density Images

PSA-SSL: Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds

LightLoc: Learning Outdoor LiDAR Localization at Light Speed

No Thing, Nothing: Highlighting Safety-Critical Classes for Robust LiDAR Semantic Segmentation in Adverse Weather

RC-AutoCalib: An End-to-End Radar-Camera Automatic Calibration Network

Pseudo Visible Feature Fine-Grained Fusion for Thermal Object Detection

Resilient Sensor Fusion Under Adverse Sensor Failures via Multi-Modal Expert Fusion

Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking

MAGiC-SLAM: Multi-Agent Gaussian Globally Consistent SLAM

SDGOCC: Semantic and Depth-Guided Bird's-Eye View Transformation for 3D Multimodal Occupancy Prediction

VoxelSplat: Dynamic Gaussian Splatting as an Effective Loss for Occupancy and Flow Prediction

GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction

DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition and Surface Reconstruction for Urban Driving Scenes

JiSAM: Alleviate Labeling Burden and Corner Case Problems in Autonomous Driving via Minimal Real-World Data

Rethinking Lanes and Points in Complex Scenarios for Monocular 3D Lane Detection

SceneCrafter: Controllable Multi-View Driving Scene Editing

Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map

CoSDH: Communication-Efficient Collaborative Perception via Supply-Demand Awareness and Intermediate-Late Hybridization

Generating Multimodal Driving Scenes via Next-Scene Prediction

Bridging Past and Future: End-to-End Autonomous Driving with Historical Prediction and Planning

MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception

CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

Evaluating Vision-Language Models as Evaluators in Path Planning

Scene Map-based Prompt Tuning for Navigation Instruction Generation

Graph Neural Network Combining Event Stream and Periodic Aggregation for Low-Latency Event-based Vision

Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection

RoboPEPP: Vision-Based Robot Pose and Joint Angle Estimation through Embedding Predictive Pre-Training

PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability

GarmentPile: Point-Level Visual Affordance Guided Retrieval and Adaptation for Cluttered Garments Manipulation

Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning

AffordDP: Generalizable Diffusion Policy with Transferable Affordance

Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction

ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning

PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model

InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing

How Do I Do That? Synthesizing 3D Hand Motion and Contacts for Everyday Interactions

TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization

EasyHOI: Unleashing the Power of Large Models for Reconstructing Hand-Object Interactions in the Wild

Temporally Consistent Object-Centric Learning by Contrasting Slots

InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation

HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos

Estimating Body and Hand Motion in an Ego‑sensed World

UMotion: Uncertainty-driven Human Motion Estimation from Inertial and Ultra-wideband Units

REWIND: Real-Time Egocentric Whole-Body Motion Diffusion with Exemplar-Based Identity Conditioning

LAL: Enhancing 3D Human Motion Prediction with Latency-aware Auxiliary Learning

LATTE-MV: Learning to Anticipate Table Tennis Hits from Monocular Videos

Pose Priors from Language Models

HOIGPT: Learning Long-Sequence Hand-Object Interaction with Language Models

HSI-GPT: A General-Purpose Large Scene-Motion-Language Model for Human Scene Interaction

SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing

TIMotion: Temporal and Interactive Framework for Efficient Human-Human Motion Generation

HuMoCon: Concept Discovery for Human Motion Understanding

ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer

DreamTrack: Dreaming the Future for Multimodal Visual Object Tracking

Seurat: From Moving Points to Depth

CH3Depth: Efficient and Flexible Depth Foundation Model with Flow Matching

Video Depth without Video Models

BiM-VFI: Bidirectional Motion Field-Guided Frame Interpolation for Video with Non-uniform Motions

Autoregressive Sequential Pretraining for Visual Tracking

IM-Zero: Instance-level Motion Controllable Video Generation in a Zero-shot Manner

Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation

Consistent and Controllable Image Animation with Motion Diffusion Models

MatAnyone: Stable Video Matting with Consistent Memory Propagation

Unboxed: Geometrically and Temporally Consistent Video Outpainting

High Dynamic Range Video Compression: A Large-Scale Benchmark Dataset and A Learned Bit-depth Scalable Compression Algorithm

ECVC: Exploiting Non-Local Correlations in Multiple Frames for Contextual Video Compression

RivuletMLP: An MLP-based Architecture for Efficient Compressed Video Quality Enhancement

Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model

AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion

Taming Teacher Forcing for Masked Autoregressive Video Generation

Self-supervised ControlNet with Spatio-Temporal Mamba for Real-world Video Super-resolution

Face Forgery Video Detection via Temporal Forgery Cue Unraveling

SVFR: A Unified Framework for Generalized Video Face Restoration

RSAR: Restricted State Angle Resolver and Rotated SAR Benchmark

RobSense: A Robust Multi-modal Foundation Model for Remote Sensing with Static, Temporal, and Incomplete Data Adaptability

A Selective Re-learning Mechanism for Hyperspectral Fusion Imaging

A General Adaptive Dual-level Weighting Mechanism for Remote Sensing Pansharpening

Task-driven Image Fusion with Learnable Fusion Loss

Channel Consistency Prior and Self-Reconstruction Strategy Based Unsupervised Image Deraining

OSMamba: Omnidirectional Spectral Mamba with Dual-Domain Prior Generator for Exposure Correction

MaIR: A Locality- and Continuity-Preserving Mamba for Image Restoration

Zero-Shot Blind-spot Image Denoising via Implicit Neural Sampling

Adaptive Dropout: Unleashing Dropout across Layers for Generalizable Image Super-Resolution

Vision-Language Gradient Descent-driven All-in-One Deep Unfolding Networks

DifIISR: A Diffusion Model with Gradient Guidance for Infrared Image Super-Resolution

Reversing Flow for Image Restoration

Navigating Image Restoration with VAR’s Distribution Alignment Prior

Image Quality Assessment: From Human to Machine Preference

DnLUT: Ultra-Efficient Color Image Denoising via Channel-Aware Lookup Tables

Detail-Preserving Latent Diffusion for Stable Shadow Removal

Shadow Generation Using Diffusion Model with Geometry Prior

TurboFill: Adapting Few-step Text-to-image Model for Fast Image Inpainting

Linear Attention Modeling for Learned Image Compression

Multirate Neural Image Compression with Adaptive Lattice Vector Quantization

Generative Image Layer Decomposition with Visual Effects

NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

Reanimating Images using Neural Representations of Dynamic Stimuli

Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training

Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis

Style Quantization for Data-Efficient GAN Training

Temporal Score Analysis for Understanding and Correcting Diffusion Artifacts

Efficient Personalization of Quantized Diffusion Model without Backpropagation

Layered Image Vectorization via Semantic Simplification

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

MotiF: Making Text Count in Image Animation with Motion Focal Loss

Visual Prompting for One-shot Controllable Video Editing without Inversion

Tiled Diffusion

Evaluating Model Perception of Color Illusions in Photorealistic Scenes

Charm: The Missing Piece in ViT Fine-Tuning for Image Aesthetic Assessment

Morpheus: Text-Driven 3D Gaussian Splat Shape and Color Stylization

Optical-Flow Guided Prompt Optimization for Coherent Video Generation

OmniStyle: Filtering High Quality Style Transfer Data at Scale

Pathways on the Image Manifold: Image Editing via Video Generation

PhyS-EdiT: Physics-aware Semantic Image Editing with Text Description

Stable Flow: Vital Layers for Training-Free Image Editing

Improving Editability in Image Generation with Layer-wise Memory

EditAR: Unified Conditional Generation with Autoregressive Models

Zero-Shot Styled Text Image Generation, but Make It Autoregressive

Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis

Generative Photomontage

Multi-party Collaborative Attention Control for Image Customization

ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation

UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation

MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling

Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

VideoDPO: Omni-Preference Alignment for Video Diffusion Generation

Personalized Preference Fine-tuning of Diffusion Models

Text Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps

VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness

Learning Visual Generative Priors without Text

Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation

CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

PosterMaker: Towards High-Quality Product Poster Generation with Accurate Text Rendering

Multitwine: Multi-Object Compositing with Text and Layout Control

MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation

PosterO: Structuring Layout Trees to Enable Language Models in Generalized Content-Aware Layout Generation

From Elements to Design: A Layered Approach for Automatic Graphic Design Composition

AIpparel: A Multimodal Foundation Model for Digital Garments

ChatHuman: Chatting about 3D Humans with Tools

Interpretable Generative Models through Post-hoc Concept Bottlenecks

Concept Replacer: Replacing Sensitive Concepts in Diffusion Models via Precision Localization

Enhancing Privacy-Utility Trade-offs to Mitigate Memorization in Diffusion Models

Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability

Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models

SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-image Diffusion Models

Watermarking One for All: A Robust Watermarking Scheme Against Partial Image Theft

Enhancing Facial Privacy Protection via Weakening Diffusion Purification

Community Forensics: Using Thousands of Generators to Train Fake Image Detectors

Beyond Generation: A Diffusion-based Low-level Feature Extractor for Detecting AI-generated Images

Unveiling Differences in Generative Models: A Scalable Differential Clustering Approach

The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition

MIRE: Matched Implicit Neural Representations

Learning to Normalize on the SPD Manifold under Bures-Wasserstein Geometry

Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery

Gazing Into Missteps: Leveraging Eye-Gaze for Unsupervised Mistake Detection in Egocentric Videos of Skilled Human Activities

Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model

SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding

Object-aware Sound Source Localization via Audio-Visual Scene Understanding

Revisiting Audio-Visual Segmentation with Vision-Centric Transformer

Towards Open-Vocabulary Audio-Visual Event Localization

Contextual AD Narration with Interleaved Multimodal Sequence

Towards Universal Soccer Video Understanding

Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

Event-Equalized Dense Video Captioning

Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content

SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?

VITED: Video Temporal Evidence Distillation

DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding

Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation

AdaCM^2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction

Temporal Alignment-Free Video Matching for Few-shot Action Recognition

HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding

Video Language Model Pretraining with Spatio-temporal Masking

Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models

Re-thinking Temporal Search for Long-Form Video Understanding

LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding

Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents

PHGC: Procedural Heterogeneous Graph Completion for Natural Language Task Verification in Egocentric Videos

Rethinking Spiking Self-Attention Mechanism: Implementing α-XNOR Similarity Calculation in Spiking Transformers

Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level

V^2Dial: Unification of Video and Visual Dialog via Multimodal Experts

Towards Unbiased and Robust Spatio-Temporal Scene Graph Generation and Anticipation

GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation

LiVOS: Light Video Object Segmentation with Gated Linear Matching

VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models

Track Any Anomalous Object:A Granular Video Anomaly Detection Pipeline

Context-Enhanced Memory-Refined Transformer for Online Action Detection

Bridge the Gap: From Weak to Full Supervision for Temporal Action Localization with PseudoFormer

Neuron: Learning Context-Aware Evolving Representations for Zero-Shot Skeleton Action Recognition

MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking

Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues

Theoretical Insights in Model Inversion Robustness and Conditional Entropy Maximization for Collaborative Inference Systems

Where the Devil Hides: Deepfake Detectors Can No Longer Be Trusted

FreqDebias: Towards Generalizable Deepfake Detection via Consistency-Driven Frequency Debiasing

Omni-ID: Holistic Identity Representation Designed for Generative Tasks

VISTREAM: Improving Computation Efficiency of Visual Streaming Perception via Law-of-Charge-Conservation Inspired Spiking Neural Network

Temporal Separation with Entropy Regularization for Knowledge Distillation in Spiking Neural Networks

Binarized Mamba-Transformer for Lightweight Quad Bayer HybridEVS Demosaicing

From Laboratory to Real World: A New Benchmark Towards Privacy-Preserved Visible-Infrared Person Re-Identification

DefMamba: Deformable Visual State Space Model

ABBSPO: Adaptive Bounding Box Scaling and Symmetric Prior based Orientation Prediction for Detecting Aerial Image Objects

Towards RAW Object Detection in Diverse Conditions

Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models

Advancing Manga Analysis: Comprehensive Segmentation Annotations for the Manga109 Dataset

Sketchy Bounding-box Supervision for 3D Instance Segmentation

Relation3D : Enhancing Relation Modeling for Point Cloud Instance Segmentation

FSHNet: Fully Sparse Hybrid Network for 3D Object Detection

3D-AVS: LiDAR-based 3D Auto-Vocabulary Segmentation

NTClick: Achieving Precise Interactive Segmentation With Noise-tolerant Clicks

HyperSeg: Hybrid Segmentation Assistant with Fine-grained Visual Perceiver

CholecTrack20: A Multi-Perspective Tracking Dataset for Surgical Tools

Hybrid Reciprocal Transformer with Triplet Feature Alignment for Scene Graph Generation

Textured Gaussians for Enhanced 3D Scene Appearance Modeling

Global-Local Tree Search in VLMs for 3D Indoor Scene Generation

CrossOver: 3D Scene Cross-Modal Alignment

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence

Ges3ViG : Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding

ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction

Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy

Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Synthetic Visual Genome

Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly

DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding

Taxonomy-Aware Evaluation of Vision-Language Models

Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models

K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences

AVF-MAE++: Scaling Affective Video Facial Masked Autoencoders via Efficient Audio-Visual Self-Supervised Learning

FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs

UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation

WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and Segmentation

COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts

Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

Variance-Based Membership Inference Attacks Against Large-Scale Image Captioning Models

Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification

Rethinking Noisy Video-Text Retrieval via Relation-aware Alignment

Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation

Seeing the Abstract: Translating the Abstract Language for Vision Language Models

NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval

Bridging Modalities: Improving Universal Multimodal Retrieval by Multimodal Large Language Models

Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation

Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition

SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting

Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding

Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction

SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization

CASP: Compression of Large Multimodal Models Based on Attention Sparsity

Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference

DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

Libra-Merging: Importance-redundancy and Pruning-merging Trade-off for Acceleration Plug-in in Large Vision-Language Model

AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization

Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization

From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration

EfficientLLaVA: Generalizable Auto-Pruning for Large Vision-language Models

Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts

Distraction is All You Need for Multimodal Large Language Model Jailbreaking

Revisiting Backdoor Attacks against Large Vision-Language Models from Domain Shift

ICP: Immediate Compensation Pruning for Mid-to-high Sparsity

Vision-Language Model IP Protection via Prompt-based Learning

A3: Few-shot Prompt Learning of Unlearnable Examples with Cross-Modal Adversarial Feature Alignment

Explaining Domain Shifts in Language: Concept Erasing for Interpretable Image Classification

VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning

Free on the Fly: Enhancing Flexibility in Test-Time Adaptation with Online EM

SAIST: Segment Any Infrared Small Target Model Guided by Contrastive Language-Image Pretraining

Domain Generalization in CLIP via Learning with Diverse Text Prompts

Enhanced Visual-Semantic Interaction with Tailored Prompts for Pedestrian Attribute Recognition

LOCORE: Image Re-ranking with Long-Context Sequence Modeling

Visual Consensus Prompting for Co-Salient Object Detection

Explainable Saliency: Articulating Reasoning with Contextual Prioritization

Finer-CAM: Spotting the Difference Reveals Finer Details for Visual Explanation

Perceptual Inductive Bias Is What You Need Before Contrastive Learning

Scaling Vision Pre-Training to 4K Resolution

Multimodal Autoregressive Pre-training of Large Vision Encoders

Sensitivity-Aware Efficient Fine-Tuning via Compact Dynamic-Rank Adaptation

UNEM: UNrolled Generalized EM for Transductive Few-Shot Learning

MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining

APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers

Two is Better than One: Efficient Ensemble Defense for Robust and Compact Models

Building Vision Models upon Heat Conduction

LSNet: See Large, Focus Small

SATA: Spatial Autocorrelation Token Analysis for Enhancing the Robustness of Vision Transformers

Token Cropr: Faster ViTs for Quite a Few Tasks

Hypergraph Vision Transformers: Images are More than Nodes, More than Edges

Interpretable Image Classification via Non-parametric Part Prototype Learning

COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation

Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation

LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation

Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation

SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation

Weakly Supervised Semantic Segmentation via Progressive Confidence Region Expansion

Beyond Background Shift: Rethinking Instance Replay in Continual Semantic Segmentation

Dual-Agent Optimization framework for Cross-Domain Few-Shot Segmentation

Beyond Image Classification: A Video Benchmark and Dual-Branch Hybrid Discrimination Framework for Compositional Zero-Shot Learning

Targeted Forgetting of Image Subgroups in CLIP Models

Enhancing Few-Shot Class-Incremental Learning via Training-Free Bi-Level Modality Calibration

Hyperbolic Category Discovery

Solving Instance Detection from an Open-World Perspective

Learning Class Prototypes for Unified Sparse-Supervised 3D Object Detection

Generalized Diffusion Detector: Mining Robust Features from Diffusion Models for Domain-Generalized Detection

Mr. DETR: Instructive Multi-Route Training for Detection Transformers

PatchDEMUX: A Certifiably Robust Framework for Multi-label Classifiers Against Adversarial Patches

Disentangling Safe and Unsafe Image Corruptions via Anisotropy and Locality

Beyond Single-Modal Boundary: Cross-Modal Anomaly Detection through Visual Prototype and Harmonization

Exploring Intrinsic Normal Prototypes within a Single Image for Universal Anomaly Detection

Multi-Sensor Object Anomaly Detection: Unifying Appearance, Geometry, and Internal Properties

UniNet: A Contrastive Learning-guided Unified Framework with Feature Selection for Anomaly Detection

NN-Former: Rethinking Graph Structure in Neural Architecture Representation

Enhancing Dataset Distillation via Non-Critical Region Refinement

Efficient ANN-Guided Distillation: Aligning Rate-based Features of Spiking Neural Networks through Hybrid Block-wise Replacement

Breaking the Memory Barrier of Contrastive Loss via Tile-Based Strategy

LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty

DeRS: Towards Extremely Efficient Upcycled Mixture-of-Experts Models

Towards Consistent Multi-Task Learning: Unlocking the Potential of Task-Specific Parameters

Do Your Best and Get Enough Rest for Continual Learning

Self-Expansion of Pre-trained Models with Mixture of Adapters for Continual Learning

Task-Agnostic Guided Feature Expansion for Class-Incremental Learning

OSLoPrompt: Bridging Low-Supervision Challenges and Open-Set Domain Generalization in CLIP

Reducing Class-wise Confusion for Incremental Learning with Disentangled Manifolds

Unsupervised Continual Domain Shift Learning with Multi-Prototype Modeling

A Theory of Learning Unified Model via Knowledge Integration from Label Space Varying Domains

Rethinking Epistemic and Aleatoric Uncertainty for Active Open-Set Annotation: An Energy-Based Approach

Towards Cost-Effective Learning: A Synergy of Semi-Supervised and Active Learning

Mind the Gap: Confidence Discrepancy Can Guide Federated Semi-Supervised Learning Across Pseudo-Mismatch

Beyond Clean Training Data: A Versatile and Model-Agnostic Framework for Out-of-Distribution Detection with Contaminated Training Data

DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection

dFLMoE: Decentralized Federated Learning via Mixture of Experts for Medical Data Analysis

Population Normalization for Federated Learning

Gradient Inversion Attacks on Parameter-Efficient Fine-Tuning

Mind the Gap: Detecting Black-box Adversarial Attacks in the Making through Query Update Analysis

Doppelgängers and Adversarial Vulnerability

PSBD: Prediction Shift Uncertainty Unlocks Backdoor Detection

Rethinking the Adversarial Robustness of Multi-Exit Neural Networks in an Attack-Defense Game

Learning-enabled Polynomial Lyapunov Function Synthesis via High-Accuracy Counterexample-Guided Framework

AdaptCMVC: Robust Adaption to Incremental Views in Continual Multi-view Clustering

Medusa: A Multi-Scale High-order Contrastive Dual-Diffusion Approach for Multi-View Clustering

A Tale of Two Classes: Adapting Supervised Contrastive Learning to Binary Imbalanced Datasets

On the Out-Of-Distribution Generalization of Large Multimodal Models

DiffCAM: Data-Driven Saliency Maps by Capturing Feature Differences

Dual-view X-ray Detection: Can AI Detect Prohibited Items from Dual-view X-ray Images like Humans?

Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation

CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology

BioX-CPath: Biologically-driven Explainable Diagnostics for Multistain IHC Computational Pathology

Robust Multimodal Survival Prediction with Conditional Latent Differentiation Variational AutoEncoder

Boost the Inference with Co-training: A Depth-guided Mutual Learning Framework for Semi-supervised Medical Polyp Segmentation

Annotation Ambiguity Aware Semi-Supervised Medical Image Segmentation

Unified Medical Lesion Segmentation via Self-referring Indicator

Minding Fuzzy Regions: A Data-driven Alternating Learning Paradigm for Stable Lesion Segmentation

EffiDec3D: An Optimized Decoder for High-Performance and Efficient 3D Medical Image Segmentation

Learning Dynamic Collaborative Network for Semi-supervised 3D Vessel Segmentation

Unraveling Normal Anatomy via Fluid-Driven Anomaly Randomization

Blood Flow Speed Estimation with Optical Coherence Tomography Angiography Images

3D Dental Model Segmentation with Geometrical Boundary Preserving

(ends 6:00 PM)

Demonstration:

Demos

(ends 6:00 PM)

5 p.m.

Art Gallery Tour with Curator, Luba Elliott [5:00-5:30]

Chairs: Luba Elliott

(ends 5:30 PM)

SAT 14 JUN

7:30 a.m.

Registration / Badge Pickup

(ends 7:30 PM)

Break:

Breakfast

(ends 9:00 AM)

8 a.m.

Poster Setup:

Poster Setup

(ends 8:30 AM)

9 a.m.

Oral Session 3A: 3D Computer Vision [9:00-10:15]

Orals 9:00-10:15

[9:00] MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos

[9:15] Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos

[9:30] Continuous 3D Perception Model with Persistent State

[9:45] TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion

[10:00] Neural Inverse Rendering from Propagating Light

(ends 10:15 AM)

Oral Session 3B: Multimodal Computer Vision [9:00-10:15]

Orals 9:00-10:15

[9:00] SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images

[9:15] Towards Universal Dataset Distillation via Task-Driven Diffusion

[9:30] IceDiff: High Resolution and High-Quality Arctic Sea Ice Forecasting with Generative Diffusion Prior

[9:45] Efficient Test-time Adaptive Object Detection via Sensitivity-Guided Pruning

[10:00] Keep the Balance: A Parameter-Efficient Symmetrical Framework for RGB+X Semantic Segmentation

(ends 10:15 AM)

Oral Session 3C: Vision and Language [9:00-10:15]

Orals 9:00-10:15

[9:00] Identifying and Mitigating Position Bias of Multi-image Vision-Language Models

[9:15] Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key

[9:30] Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content

[9:45] Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

[10:00] From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons

(ends 10:15 AM)

10:30 a.m.

Demonstration:

Demos

(ends 12:30 PM)

Poster Session 3 [10:30-12:30]

Posters 10:30-12:30

LLM-driven Multimodal and Multi-Identity Listening Head Generation

INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations

AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers

InsTaG: Learning Personalized 3D Talking Head from Few-Second Video

Dynamic Stereotype Theory Induced Micro-expression Recognition with Oriented Deformation

Coherent 3D Portrait Video Reconstruction via Triplane Fusion

TaoAvatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting

Synthetic Prior for Few-Shot Drivable Head Avatar Inversion

RGBAvatar: Reduced Gaussian Blendshapes for Online Modeling of Head Avatars

AvatarArtist: Open-Domain 4D Avatarization

Arc2Avatar: Generating Expressive 3D Avatars from a Single Image via ID Guidance

Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters

PhysAnimator: Physics-Guided Generative Cartoon Animation

Zero-Shot Head Swapping in Real-World Scenarios

CaricatureBooth: Data-Free Interactive Caricature Generation in a Photo Booth

FFaceNeRF: Few-shot Face Editing in Neural Radiance Fields

D^3-Human: Dynamic Disentangled Digital Human from Monocular Video

DiffLocks: Generating 3D Hair from a Single Image using Diffusion Models

Remote Photoplethysmography in Real-World and Extreme Lighting Scenarios

GCC: Generative Color Constancy via Diffusing a Color Checker

DarkIR: Robust Low-Light Image Restoration

PolarFree: Polarization-based Reflection-Free Imaging

OpticalNet: An Optical Imaging Dataset and Benchmark Beyond the Diffraction Limit

A Physics-Informed Blur Learning Framework for Imaging Systems

MaDCoW: Marginal Distortion Correction for Wide-Angle Photography with Arbitrary Objects

Generative Multiview Relighting for 3D Reconstruction under Extreme Illumination Variation

IRGS: Inter-Reflective Gaussian Splatting with 2D Gaussian Ray Tracing

Volumetrically Consistent 3D Gaussian Rasterization

MultimodalStudio: A Heterogeneous Sensor Dataset and Framework for Neural Rendering across Multiple Imaging Modalities

Neural Inverse Rendering from Propagating Light

PBR-NeRF: Inverse Rendering with Physics-Based Neural Fields

MAGE : Single Image to Material-Aware 3D via the Multi-View G-Buffer Estimation Model

3D-HGS: 3D Half-Gaussian Splatting

Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling

Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation

ProbeSDF: Light Field Probes For Neural Surface Reconstruction

Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D Data

FruitNinja: 3D Object Interior Texture Generation with Gaussian Splatting

DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation

CADCrafter: Generating Computer-Aided Design Models from Unconstrained Images

MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation

Scaling Mesh Generation via Compressive Tokenization

Hierarchical Gaussian Mixture Model Splatting for Efficient and Part Controllable 3D Generation

Identity-preserving Distillation Sampling by Fixed-Point Iterator

PhysicsGen: Can Generative Models Learn from Images to Predict Complex Physical Relations?

EditSplat: Multi-View Fusion and Attention-Guided Optimization for View-Consistent 3D Scene Editing with 3D Gaussian Splatting

DashGaussian: Optimizing 3D Gaussian Splatting in 200 Seconds

Efficient Decoupled Feature 3D Gaussian Splatting via Hierarchical Compression

SOGS: Second-Order Anchor for Advanced 3D Gaussian Splatting

RestorGS: Depth-aware Gaussian Splatting for Efficient 3D Scene Restoration

Seeing A 3D World in A Grain of Sand

CoA: Towards Real Image Dehazing via Compression-and-Adaptation

S2D-LFE: Sparse-to-Dense Light Field Event Generation

Depth-Guided Bundle Sampling for Efficient Generalizable Neural Radiance Field Reconstruction

FrugalNeRF: Fast Convergence for Extreme Few-shot Novel View Synthesis without Learned Priors

MirrorVerse: Pushing Diffusion Models to Realistically Reflect the World

Matrix3D: Large Photogrammetry Model All-in-One

SPC-GS: Gaussian Splatting with Semantic-Prompt Consistency for Indoor Open-World Free-view Synthesis from Sparse Inputs

Geometry-guided Online 3D Video Synthesis with Multi-View Temporal Consistency

EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis

MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention

Sparse2DGS: Geometry-Prioritized Gaussian Splatting for Surface Reconstruction from Sparse Views

NeRFPrior: Learning Neural Radiance Field as a Prior for Indoor Scene Reconstruction

Efficient Video Super-Resolution for Real-time Rendering with Decoupled G-buffer Guidance

MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting

RePerformer: Immersive Human-centric Volumetric Videos from Playback to Photoreal Reperformance

DecoupledGaussian: Object-Scene Decoupling for Physics-Based Interaction

Thin-Shell-SfT: Fine-Grained Monocular Non-rigid 3D Surface Tracking with Neural Deformation Fields

Co-Speech Gesture Video Generation with Implicit Motion-Audio Entanglement

QuCOOP: A Versatile Framework for Solving Composite and Binary-Parametrised Problems on Quantum Annealers

Image Reconstruction from Readout-Multiplexed Single-Photon Detector Arrays

Spk2SRImgNet: Super-Resolve Dynamic Scene from Spike Stream via Motion Aligned Collaborative Filtering

EventPSR: Surface Normal and Reflectance Estimation from Photometric Stereo Using an Event Camera

PanSplat: 4K Panorama Synthesis with Feed-Forward Gaussian Splatting

QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge

WildGS-SLAM: Monocular Gaussian Splatting SLAM in Dynamic Environments

Continuous 3D Perception Model with Persistent State

MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos

Joint Optimization of Neural Radiance Fields and Continuous Camera Motion from a Monocular Video

Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision

MVSAnywhere: Zero-Shot Multi-View Stereo

Three-view Focal Length Recovery From Homographies

Full-DoF Egomotion Estimation for Event Cameras Using Geometric Solvers

GeoDepth: From Point-to-Depth to Plane-to-Depth Modeling for Self-Supervised Monocular Depth Estimation

R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization

HyperPose: Hypernetwork-Infused Camera Pose Localization and an Extended Cambridge Landmarks Dataset

Learning to Filter Outlier Edges in Global SfM

Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos

Towards Optimizing Large-Scale Multi-Graph Matching in Bioimaging

Bridging Viewpoint Gaps: Geometric Reasoning Boosts Semantic Correspondence

MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation

Multi-View Pose-Agnostic Change Localization with Zero Labels

Structure-Aware Correspondence Learning for Relative Pose Estimation

Co-op: Correspondence-based Novel Object Pose Estimation

Any6D: Model-free 6D Pose Estimation of Novel Object

CRISP: Object Pose and Shape Estimation with Test-Time Adaptation

CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image

EchoMatch: Partial-to-Partial Shape Matching via Correspondence Reflection

Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation

Focusing on Tracks for Online Multi-Object Tracking

GRAE-3DMOT: Geometry Relation-Aware Encoder for Online 3D Multi-Object Tracking

PointSR: Self-Regularized Point Supervision for Drone-View Object Detection

Multi-Modal Aerial-Ground Cross-View Place Recognition with Neural ODEs

OffsetOPT: Explicit Surface Reconstruction without Normals

High-Fidelity Lightweight Mesh Reconstruction from Point Clouds

Parametric Point Cloud Completion for Polygonal Surface Reconstruction

Self-Supervised Large Scale Point Cloud Completion for Archaeological Site Restoration

Dual Focus-Attention Transformer for Robust Point Cloud Registration

Generalized Gaussian Entropy Model for Point Cloud Attribute Compression with Dynamic Likelihood Intervals

TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion

SeaLion: Semantic Part-Aware Latent Point Diffusion Models for 3D Generation

Spectral Informed Mamba for Robust Point Cloud Processing

Hyperbolic Uncertainty-Aware Few-Shot Incremental Point Cloud Segmentation

CamPoint: Boosting Point Cloud Segmentation with Virtual Camera

ReRAW: RGB-to-RAW Image Reconstruction via Stratified Sampling for Efficient Object Detection on the Edge

ViKIENet: Towards Efficient 3D Object Detection with Virtual Key Instance Enhanced Network

ViiNeuS: Volumetric Initialization for Implicit Neural Surface Reconstruction of Urban Scenes with Limited Image Overlap

D^3CTTA: Domain-Dependent Decorrelation for Continual Test-Time Adaption of 3D LiDAR Segmentation

Spotting the Unexpected (STU): A 3D LiDAR Dataset for Anomaly Segmentation in Autonomous Driving

Seeing is Not Believing: Adversarial Natural Object Optimization for Hard-Label 3D Scene Attacks

Detection-Friendly Nonuniformity Correction: A Union Framework for Infrared UAV Target Detection

RCP-Bench: Benchmarking Robustness for Collaborative Perception Under Diverse Corruptions

Generative Map Priors for Collaborative BEV Semantic Segmentation

SGFormer: Satellite-Ground Fusion for 3D Semantic Scene Completion

Three Cars Approaching within 100m! Enhancing Distant Geometry by Tri-Axis Voxel Scanning for Camera-based Semantic Scene Completion

OccMamba: Semantic Occupancy Prediction with State Space Models

GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

UniScene: Unified Occupancy-centric Driving Scene Generation

SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

FreeSim: Toward Free-viewpoint Camera Simulation in Driving Scenes

DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation

Transfer Your Perspective: Controllable 3D Generation from Any Viewpoint in a Driving Scene

DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving

TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception

Trajectory Mamba: Efficient Attention-Mamba Forecasting Model Based on Selective SSM

SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving

Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method

MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving

Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models

Reasoning in Visual Navigation of End-to-end Trained Agents: A Dynamical Systems Approach

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments

A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning

Robotic Visual Instruction

DynScene: Scalable Generation of Dynamic Robotic Manipulation Scenes for Embodied AI

FlowRAM: Grounding Flow Matching Policy with Region-Aware Mamba Framework for Robotic Manipulation

GENMANIP: LLM-driven Simulation for Generalizable Instruction-Following Manipulation

UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping

ManiVideo: Generating Hand-Object Manipulation Video with Dexterous and Generalizable Grasping

Hand-held Object Reconstruction from RGB Video with Dynamic Interaction

UniHOPE: A Unified Approach for Hand-Only and Hand-Object Pose Estimation

WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild

Analyzing the Synthetic-to-Real Domain Gap in 3D Hand Pose Estimation

InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions

PoseBH: Prototypical Multi-Dataset Training Beyond Human Pose Estimation

M3GYM: A Large-Scale Multimodal Multi-view Multi-person Pose Dataset for Fitness Activity Understanding in Real-world Settings

Certified Human Trajectory Prediction

ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate

Physical Plausibility-aware Trajectory Prediction via Locomotion Embodiment

Vision-Guided Action: Enhancing 3D Human Motion Prediction with Gaze-informed Affordance in 3D Scenes

On Denoising Walking Videos for Gait Recognition

ChainHOI: Joint-based Kinematic Chain Modeling for Human-Object Interaction Generation

StickMotion: Generating 3D Human Motions by Drawing a Stickman

MixerMDM: Learnable Composition of Human Motion Diffusion Models

HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation

Poly-Autoregressive Prediction for Modeling Interactions

Adapting Pre-trained 3D Models for Point Cloud Video Understanding via Cross-frame Spatio-temporal Perception

Recovering Dynamic 3D Sketches from Videos

FreeGave: 3D Physics Learning from Dynamic Videos by Gaussian Velocity

Dynamic Camera Poses and Where to Find Them

Repurposing Pre-trained Video Diffusion Models for Event-based Video Interpolation

InterDyn: Controllable Interactive Dynamics with Video Diffusion Models

DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching

LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics

Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think

Generative Omnimatte: Learning to Decompose Video into Layers

RL-RC-DoT: A Block-level RL agent for Task-Aware Video Compression

Towards Practical Real-Time Neural Video Compression

Neural Video Compression with Context Modulation

Event-based Video Super-Resolution via State Space Models

STDD: Spatio-Temporal Dual Diffusion for Video Generation

IceDiff: High Resolution and High-Quality Arctic Sea Ice Forecasting with Generative Diffusion Prior

OSV: One Step is Enough for High-Quality Image to Video Generation

I2VGuard: Safeguarding Images against Misuse in Diffusion-based Image-to-Video Models

CASP: Consistency-aware Audio-induced Saliency Prediction Model for Omnidirectional Video

Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning

OSDFace: One-Step Diffusion Model for Face Restoration

MFogHub: Bridging Multi-Regional and Multi-Satellite Data for Global Marine Fog Detection and Forecasting

Feature Spectrum Learning for Remote Sensing Change Detection

Dual-Granularity Semantic Guided Sparse Routing Diffusion Model for General Pansharpening

Hyperspectral Pansharpening via Diffusion Models with Iteratively Zero-Shot Guidance

Complementary Advantages: Exploiting Cross-Field Frequency Correlation for NIR-Assisted Image Denoising

Hazy Low-Quality Satellite Video Restoration Via Learning Optimal Joint Degradation Patterns and Continuous-Scale Super-Resolution Reconstruction

Iterative Predictor-Critic Code Decoding for Real-World Image Dehazing

Efficient Visual State Space Model for Image Deblurring

Rotation-Equivariant Self-Supervised Method in Image Denoising

A Universal Scale-Adaptive Deformable Transformer for Image Restoration across Diverse Artifacts

Toward Generalized Image Quality Assessment: Relaxing the Perfect Reference Quality Assumption

Complexity Experts are Task-Discriminative Learners for Any Image Restoration

Visual-Instructed Degradation Diffusion for All-in-One Image Restoration

PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution

Edge-SD-SR: Low Latency and Parameter Efficient On-device Super-Resolution with Stable Diffusion via Bidirectional Conditioning

HUNet: Homotopy Unfolding Network for Image Compressive Sensing

Dual Prompting Image Restoration with Diffusion Transformers

Frequency-Biased Synergistic Design for Image Compression and Compensation

FIRE: Robust Detection of Diffusion-Generated Images via Frequency-Guided Reconstruction Error

Robust Message Embedding via Attention Flow-Based Steganography

Learned Image Compression with Dictionary-based Entropy Model

D^2iT: Dynamic Diffusion Transformer for Accurate Image Generation

Classifier-Free Guidance Inside the Attraction Basin May Cause Memorization

Not All Parameters Matter: Masking Diffusion Models for Enhancing Generation Ability

BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers

Diffusion Model is Effectively Its Own Teacher

Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward

RaSS: Improving Denoising Diffusion Samplers with Reinforced Active Sampling Scheduler

A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

Scaling Properties of Diffusion Models For Perceptual Tasks

Parallelized Autoregressive Visual Generation

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way

Keyframe-Guided Creative Video Inpainting

SemanticDraw: Towards Real-Time Interactive Content Creation from Image Diffusion Models

TKG-DM: Training-free Chroma Key Content Generation Diffusion Model

K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs

SCSA: A Plug-and-Play Semantic Continuous-Sparse Attention for Arbitrary Semantic Style Transfer

MARBLE: Material Recomposition and Blending in CLIP-Space

MagicQuill: An Intelligent Interactive Image Editing System

FluxSpace: Disentangled Semantic Editing in Rectified Flow Models

FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model

Recognition-Synergistic Scene Text Editing

HyperLoRA: Parameter-Efficient Adaptive Generation for Portrait Synthesis

Self-Evolving Visual Concept Library using Vision-Language Critics

Training-free Dense-Aligned Diffusion Guidance for Modular Conditional Image Synthesis

Movie Weaver: Tuning-Free Multi-Concept Video Personalization with Anchored Prompts

AMO Sampler: Enhancing Text Rendering with Overshooting

ArtiFade: Learning to Generate High-quality Subject from Blemished Images

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

LoRACLR: Contrastive Adaptation for Customization of Diffusion Models

Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization

Composing Parts for Expressive Object Generation

DyMO: Training-Free Diffusion Model Alignment with Dynamic Multi-Objective Scheduling

Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions

Make It Count: Text-to-Image Generation with an Accurate Number of Objects

Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization

MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation

StoryGPT-V: Large Language Models as Consistent Story Visualizers

ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting

OmniGen: Unified Image Generation

ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts

From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing

Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation

EEE-Bench: A Comprehensive Multimodal Electrical And Electronics Engineering Benchmark

Towards Precise Embodied Dialogue Localization via Causality Guided Diffusion

Rethinking Training for De-biasing Text-to-Image Generation: Unlocking the Potential of Stable Diffusion

Rectified Diffusion Guidance for Conditional Generation

T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation

The Illusion of Unlearning: The Unstable Nature of Machine Unlearning in Text-to-Image Diffusion Models

Towards Universal Dataset Distillation via Task-Driven Diffusion

RealEdit: Reddit Edits As a Large-scale Empirical Dataset for Image Transformations

Harnessing Global-Local Collaborative Adversarial Perturbation for Anti-Customization

Decoder Gradient Shield: Provable and High-Fidelity Prevention of Gradient-Based Box-Free Watermark Removal

Silence is Golden: Leveraging Adversarial Examples to Nullify Audio Control in LDM-based Talking-Head Generation

Secret Lies in Color: Enhancing AI-Generated Images Detection with Color Distribution Analysis

CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI

FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement

VI^3NR: Variance Informed Initialization for Implicit Neural Representations

EigenGS Representation: From Eigenspace to Gaussian Image Space

Few-shot Personalized Scanpath Prediction

Enhancing 3D Gaze Estimation in the Wild using Weak Supervision with Gaze Following Labels

FilmComposer: LLM-Driven Music Production for Silent Film Clips

VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation

Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes

Audio-Visual Instance Segmentation

UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing

DistinctAD: Distinctive Audio Description Generation in Contexts

ExpertAF: Expert Actionable Feedback from Video

FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

LLaVA-Critic: Learning to Evaluate Multimodal Models

Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation

Progress-Aware Video Frame Captioning

Learning from Streaming Video with Orthogonal Gradients

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations

VEU-Bench: Towards Comprehensive Understanding of Video Editing

Question-Aware Gaussian Experts for Audio-Visual Question Answering

MLVU: Benchmarking Multi-task Long Video Understanding

M-LLM Based Video Frame Selection for Efficient Video Understanding

On the Consistency of Video Large Language Models in Temporal Comprehension

VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding

ReWind: Understanding Long Videos with Instructed Learnable Memory

MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations

Number it: Temporal Grounding Videos like Flipping Manga

Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding

STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding

SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction

SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation

DTOS: Dynamic Time Object Sensing with Large Multimodal Model

Decoupled Motion Expression Video Segmentation

EdgeTAM: On-Device Track Anything Model

Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity

MammAlps: A Multi-view Video Behavior Monitoring Dataset of Wild Mammals in the Swiss Alps

Boosting Point-Supervised Temporal Action Localization through Integrating Query Reformation and Optimal Transport

Semantic-guided Cross-Modal Prompt Learning for Skeleton-based Zero-shot Action Recognition

Mono3DVLT: Monocular-Video-Based 3D Visual Language Tracking

FSboard: Over 3 Million Characters of ASL Fingerspelling Collected via Smartphones

Data-free Universal Adversarial Perturbation with Pseudo-semantic Prior

Detecting Adversarial Data Using Perturbation Forgery

Stacking Brick by Brick: Aligned Feature Isolation for Incremental Face Forgery Detection

SapiensID: Foundation for Human Recognition

Spiking Transformer with Spatial-Temporal Attention

STAA-SNN: Spatial-Temporal Attention Aggregator for Spiking Neural Networks

Efficient Event-Based Object Detection: A Hybrid Neural Network with Spatial and Temporal Attention

DIFFER: Disentangling Identity Features via Semantic Cues for Clothes-Changing Person Re-ID

SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images

Mixture of Submodules for Domain Adaptive Person Search

An Image-like Diffusion Method for Human-Object Interaction Detection

Free Lunch Enhancements for Multi-modal Crowd Counting

RORem: Training a Robust Object Remover with Human-in-the-Loop

Exact: Exploring Space-Time Perceptive Clues for Weakly Supervised Satellite Image Time Series Semantic Segmentation

MaSS13K: A Matting-level Semantic Segmentation Benchmark

Insightful Instance Features for 3D Instance Segmentation

Convex Combination Star Shape Prior for Data-driven Image Semantic Segmentation

InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception

From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons

Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation

UnCommon Objects in 3D

PanoGS: Gaussian-based Panoptic Segmentation for 3D Open Vocabulary Scene Understanding

Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding

Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration

Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning

Universal Scene Graph Generation

DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering

Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields

g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks

Magma: A Foundation Model for Multimodal AI Agents

Mosaic of Modalities: A Comprehensive Benchmark for Multimodal Graph Learning

Style Evolving along Chain-of-Thought for Unknown-Domain Object Detection

Olympus: A Universal Task Router for Computer Vision Tasks

Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning

Is `Right' Right? Enhancing Object Orientation Understanding in Multimodal Large Language Models through Egocentric Instruction Tuning

Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

Model Diagnosis and Correction via Linguistic and Implicit Attribute Editing

Foundations of the Theory of Performance-Based Ranking

EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues

EMOE: Modality-Specific Enhanced Dynamic Emotion Experts

XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels

DViN: Dynamic Visual Routing Network for Weakly Supervised Referring Expression Comprehension

ROD-MLLM: Towards More Reliable Object Detection in Multimodal Large Language Models

PerLA: Perceptive 3D Language Assistant

BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs

Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification

Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content

Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval

Can Machines Understand Composition? Dataset and Benchmark for Photographic Image Composition Embedding and Understanding

Active Data Curation Effectively Distills Large-Scale Multimodal Models

Yo’Chameleon: Personalized Vision and Language Generation

Relation-Rich Visual Document Generator for Visual Information Extraction

Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding

A Simple yet Effective Layout Token in Large Language Models for Document Understanding

Teaching Large Language Models to Regress Accurate Image Quality Scores Using Score Distribution

FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs

MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures

Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation

Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation

RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models

What’s in the Image? A Deep-Dive into the Vision of Vision Language Models

Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

FlashSloth : Lightning Multimodal Large Language Models via Embedded Visual Compression

Identifying and Mitigating Position Bias of Multi-image Vision-Language Models

PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models

Conical Visual Concentration for Efficient Large Vision-Language Models

Assessing and Learning Alignment of Unimodal Vision and Language Models

Continual SFT Matches Multimodal RLHF with Negative Supervision

ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large Language Models

Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection

Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception

MLLM-as-a-Judge for Image Safety without Human Labeling

Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

Reproducible Vision-Language Models Meet Concepts Out of Pre-Training

Once-Tuning-Multiple-Variants: Tuning Once and Expanded as Multiple Vision-Language Model Variants

Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves

SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling

Task-Aware Clustering for Prompting Vision-Language Models

Learning Textual Prompts for Open-World Semi-Supervised Learning

BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models

ILIAS: Instance-Level Image retrieval At Scale

VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge

Explaining in Diffusion: Explaining a Classifier with Diffusion Semantics

Gazing at Rewards: Eye Movements as a Lens into Human and AI Decision-Making in Hybrid Visual Foraging

DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception

Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding

Lessons and Insights from a Unifying Study of Parameter-Efficient Fine-Tuning (PEFT) in Visual Recognition

TADFormer: Task-Adaptive Dynamic TransFormer for Efficient Multi-Task Learning

LoKi: Low-dimensional KAN for Efficient Fine-tuning Image Models

Training-free Neural Architecture Search through Variance of Knowledge of Deep Network Weights

FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation

Transformers without Normalization

GroupMamba: Efficient Group-Based Visual State Space Model

EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

JamMa: Ultra-lightweight Local Feature Matching with Joint Mamba

Keep the Balance: A Parameter-Efficient Symmetrical Framework for RGB+X Semantic Segmentation

Mamba-Reg: Vision Mamba Also Needs Registers

Rethinking Token Reduction with Parameter-Efficient Fine-Tuning in ViT for Pixel-Level Tasks

No Pains, More Gains: Recycling Sub-Salient Patches for Efficient High-Resolution Image Recognition

Language Guided Concept Bottleneck Models for Interpretable Continual Learning

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation

Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation

DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning

Efficient Test-time Adaptive Object Detection via Sensitivity-Guided Pruning

Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation

FisherTune: Fisher-Guided Robust Tuning of Vision Foundation Models for Domain Generalized Segmentation

POT: Prototypical Optimal Transport for Weakly Supervised Semantic Segmentation

FALCON: Fairness Learning via Contrastive Attention Approach to Continual Semantic Scene Understanding

WISNet: Pseudo Label Generation on Unbalanced and Patch Annotated Waste Images

Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning

Compositional Caching for Training-free Open-vocabulary Attribute Detection

Open Ad-hoc Categorization with Contextualized Feature Learning

MOS: Modeling Object-Scene Associations in Generalized Category Discovery

Search and Detect: Training-Free Long Tail Object Detection via Web-Image Retrieval

Fractal Calibration for Long-tailed Object Detection

Classifier-to-Bias: Toward Unsupervised Automatic Bias Detection for Visual Classifiers

DEIM: DETR with Improved Matching for Fast Convergence

CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP

FlexUOD: The Answer to Real-world Unsupervised Image Outlier Detection

UniVAD: A Training-free Unified Model for Few-shot Visual Anomaly Detection

Towards Training-free Anomaly Detection with Vision and Language Foundation Models

Real-IAD D³: A Real-World 2D/Pseudo-3D/3D Dataset for Industrial Anomaly Detection

DFM: Differentiable Feature Matching for Anomaly Detection

Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression

OPTICAL: Leveraging Optimal Transport for Contribution Allocation in Dataset Distillation

Learning Compatible Multi-Prize Subnetworks for Asymmetric Retrieval

Less is More: Efficient Model Merging with Binary Task Switch

On the Generalization of Handwritten Text Recognition Models

Investigating the Role of Weight Decay in Enhancing Nonconvex SGD

KAC: Kolmogorov-Arnold Classifier for Continual Learning

LoRA Subtraction for Drift-Resistant Space in Exemplar-Free Continual Learning

Maintaining Consistent Inter-Class Topology in Continual Test-Time Adaptation

Tripartite Weight-Space Ensemble for Few-Shot Class-Incremental Learning

T-CIL: Temperature Scaling using Adversarial Perturbation for Calibration in Class-Incremental Learning

Seeking Consistent Flat Minima for Better Domain Generalization via Refining Loss Landscapes

PEER Pressure: Model-to-Model Regularization for Single Source Domain Generalization

A Unified Framework for Heterogeneous Semi-supervised Learning

CGMatch: A Different Perspective of Semi-supervised Learning

Label Shift Meets Online Learning: Ensuring Consistent Adaptation with Universal Dynamic Regret

Overcoming Shortcut Problem in VLM for Robust Out-of-Distribution Detection

H2ST: Hierarchical Two-Sample Tests for Continual Out-of-Distribution Detection

Detecting Out-of-Distribution Through the Lens of Neural Collapse

FedCS: Coreset Selection for Federated Learning

FedCALM: Conflict-aware Layer-wise Mitigation for Selective Aggregation in Deeper Personalized Federated Learning

Model Poisoning Attacks to Federated Learning via Multi-Round Consistency

FedSPA: Generalizable Federated Graph Learning under Homophily Heterogeneity

TAET: Two-Stage Adversarial Equalization Training on Long-Tailed Distributions

Let Samples Speak: Mitigating Spurious Correlation by Exploiting the Clusterness of Samples

Uncertainty Weighted Gradients for Model Calibration

Enhancing Testing-Time Robustness for Trusted Multi-View Classification in the Wild

Enhanced then Progressive Fusion with View Graph for Multi-View Clustering

A Hubness Perspective on Representation Learning for Graph-Based Multi-View Clustering

CLOC: Contrastive Learning for Ordinal Classification with Multi-Margin N-pair Loss

STiL: Semi-supervised Tabular-Image Learning for Comprehensive Task-Relevant Information Exploration in Multimodal Classification

Q-PART: Quasi-Periodic Adaptive Regression with Test-time Training for Pediatric Left Ventricular Ejection Fraction Regression

OralXrays-9: Towards Hospital-Scale Panoramic X-ray Anomaly Detection via Personalized Multi-Object Query-Aware Mining

DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation

FOCUS: Knowledge-enhanced Adaptive Visual Compression for Few-shot Whole Slide Image Classification

M3amba: Memory Mamba is All You Need for Whole Slide Image Classification

MERGE: Multi-faceted Hierarchical Graph-based GNN for Gene Expression Prediction from Whole Slide Histopathology Images

Test-Time Domain Generalization via Universe Learning: A Multi-Graph Matching Approach for Medical Image Segmentation

CSC-PA: Cross-image Semantic Correlation via Prototype Attentions for Single-network Semi-supervised Breast Tumor Segmentation

Take the Bull by the Horns: Learning to Segment Hard Samples

Cross-Modal Interactive Perception Network with Mamba for Lung Tumor Segmentation in PET-CT Images

KMD: Koopman Multi-modality Decomposition for Generalized Brain Tumor Segmentation under Incomplete Modalities

Noise-Consistent Siamese-Diffusion for Medical Image Synthesis and Segmentation

DeNVeR: Deformable Neural Vessel Representations for Unsupervised Video Vessel Segmentation

VasTSD: Learning 3D Vascular Tree-state Space Diffusion Model for Angiography Synthesis

(ends 12:30 PM)

11 a.m.

Mentorship:

Speed Mentorship

(ends 1:30 PM)

Art Gallery Tour with Curator, Luba Elliott [11:00-11:30]

Chairs: Luba Elliott

(ends 11:30 AM)

1 p.m.

Oral Session 4A: Image and Video Synthesis [1:00-2:15]

Orals 1:00-2:15

[1:00] Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

[1:15] Language-Guided Image Tokenization for Generation

[1:30] DreamRelation: Bridging Customization and Relation Generation

[1:45] Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

[2:00] Autoregressive Distillation of Diffusion Transformers

(ends 2:15 PM)

Oral Session 4B: Embodied Computer Vision [1:00-2:15]

Orals 1:00-2:15

[1:00] PDFactor: Learning Tri-Perspective View Policy Diffusion Field for Multi-Task Robotic Manipulation

[1:15] RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

[1:30] GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill

[1:45] Navigation World Models

[2:00] Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-invariant Representation Learning

(ends 2:15 PM)

Oral Session 4C: 3D Computer Vision [1:00-2:15]

Orals 1:00-2:15

[1:00] DORNet: A Degradation Oriented and Regularized Network for Blind Depth Super-Resolution

[1:15] Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World

[1:30] Learned Binocular-Encoding Optics for RGBD Imaging Using Joint Stereo and Focus Cues

[1:45] Camera Resection from Known Line Pencils and a Radially Distorted Scanline

[2:00] Opportunistic Single-Photon Time of Flight

(ends 2:15 PM)

1:30 p.m.

Panel Discussion [1:30-2:30]

(ends 2:30 PM)

2:30 p.m.

Keynote:

The Llama Herd of Models: System 1, 2, 3 Go!

Laurens Van der Maaten

(ends 3:30 PM)

3:45 p.m.

Meeting:

PAMI TC Meeting

(ends 4:45 PM)

5 p.m.

Art Gallery Tour with Curator, Luba Elliott [5:00-5:30]

(ends 5:30 PM)

Poster Session 4 [5:00-7:00]

Posters 5:00-7:00

EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing

From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech

Diffusion-based Realistic Listening Head Generation via Hybrid Motion Modeling

VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

Hunyuan-Portrait: Implicit Condition Control for Enhanced Portrait Animation

MobilePortrait: Real-Time One-Shot Neural Head Avatars on Mobile Devices

Gaussian Eigen Models for Human Heads

Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion

PERSE: Personalized 3D Generative Avatars from A Single Portrait

WildAvatar: Learning In-the-wild 3D Avatars from the Web

Creating Your Editable 3D Photorealistic Avatar with Tetrahedron-constrained Gaussian Splatting

FreeCloth: Free-form Generation Enhances Challenging Clothed Human Modeling

MagicArticulate: Make Your 3D Models Articulation-Ready

PSHuman: Photorealistic Single-image 3D Human Reconstruction using Cross-Scale Multiview Diffusion and Explicit Remeshing

Multi-focal Conditioned Latent Diffusion for Person Image Synthesis

Robust-MVTON: Learning Cross-Pose Feature Alignment and Fusion for Robust Multi-View Virtual Try-On

GroomLight: Hybrid Inverse Rendering for Relightable Human Hair Appearance Modeling

S^3-Face: SSS-Compliant Facial Reflectance Estimation via Diffusion Priors

DL2G: Degradation-guided Local-to-Global Restoration for Eyeglass Reflection Removal

Improving Visual and Downstream Performance of Low-Light Enhancer with Vision Foundation Models Collaboration

PIDSR: Complementary Polarized Image Demosaicing and Super-Resolution

Learned Binocular-Encoding Optics for RGBD Imaging Using Joint Stereo and Focus Cues

Volume Tells: Dual Cycle-Consistent Diffusion for 3D Fluorescence Microscopy De-noising and Super-Resolution

CoCoGaussian: Leveraging Circle of Confusion for Gaussian Splatting from Defocused Images

UltraFusion: Ultra High Dynamic Imaging using Exposure Fusion

LookCloser: Frequency-aware Radiance Field for Tiny-Detail Scene

SpecTRe-GS: Modeling Highly Specular Surfaces with Reflected Nearby Objects by Tracing Rays in 3D Gaussian Splatting

SVG-IR: Spatially-Varying Gaussian Splatting for Inverse Rendering

RainyGS: Efficient Rain Synthesis with Physically-Based Gaussian Splatting

Light Transport-aware Diffusion Posterior Sampling for Single-View Reconstruction of 3D Volumes

StarVector: Generating Scalable Vector Graphics Code from Images and Text

Sparse Voxels Rasterization: Real-time High-fidelity Radiance Field Rendering

BG-Triangle: Bézier Gaussian Triangle for 3D Vectorization and Rendering

UniPhy: Learning a Unified Constitutive Model for Inverse Physics Simulation

Mesh Mamba: A Unified State Space Model for Saliency Prediction in Non-Textured and Textured Meshes

DirectTriGS: Triplane-based Gaussian Splatting Field Representation for 3D Generation

SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement

Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders

Few-shot Implicit Function Generation via Equivariance

Instant3dit: Multiview Inpainting for Fast Editing of 3D Objects

PyTorchGeoNodes: Enabling Differentiable Shape Programs for 3D Shape Reconstruction

Perturb-and-Revise: Flexible 3D Editing with Generative Trajectories

DaCapo: Score Distillation as Stacked Bridge for Fast and High-quality 3D Editing

Structure from Collision

GuardSplat: Efficient and Robust Watermarking for 3D Gaussian Splatting

DORNet: A Degradation Oriented and Regularized Network for Blind Depth Super-Resolution

FlexGS: Train Once, Deploy Everywhere with Many-in-One Flexible 3D Gaussian Splatting

Evolving High-Quality Rendering and Reconstruction in a Unified Framework with Contribution-Adaptive Regularization

OmniSplat: Taming Feed-Forward 3D Gaussian Splatting for Omnidirectional Images with Editable Capabilities

AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting

Free360: Layered Gaussian Splatting for Unbounded 360-Degree View Synthesis from Extremely Sparse and Unposed Views

Advancing Adversarial Robustness in GNeRFs: The IL2-NeRF Attack

EVPGS: Enhanced View Prior Guidance for Splatting-based Extrapolated View Synthesis

CamFreeDiff: Camera-free Image to Panorama Generation with Diffusion Model

Pippo: High-Resolution Multi-View Humans from a Single Image

3DEnhancer: Consistent Multi-View Diffusion for 3D Enhancement

MegaSynth: Scaling Up 3D Scene Reconstruction with Synthesized Data

DepthSplat: Connecting Gaussian Splatting and Depth

SimVS: Simulating World Inconsistencies for Robust View Synthesis

VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

ActiveGAMER: Active GAussian Mapping through Efficient Rendering

EAP-GS: Efficient Augmentation of Pointcloud for 3D Gaussian Splatting in Few-shot Scene Reconstruction

Shading Meets Motion: Self-supervised Indoor 3D Reconstruction Via Simultaneous Shape-from-Shading and Structure-from-Motion

Instant Gaussian Stream: Fast and Generalizable Streaming of Dynamic Scene Reconstruction via Gaussian Splatting

BARD-GS: Blur-Aware Reconstruction of Dynamic Scenes via Gaussian Splatting

GauSTAR: Gaussian Surface Tracking and Reconstruction

Opportunistic Single-Photon Time of Flight

ImViD: Immersive Volumetric Videos for Enhanced VR Engagement

Reconstructing Animals and the Wild

Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis

Dense Dispersed Structured Light for Hyperspectral 3D Imaging of Dynamic Scenes

HUSH: Holistic Panoramic 3D Scene Understanding using Spherical Harmonics

USP-Gaussian: Unifying Spike-based Image Reconstruction, Pose Correction and Gaussian Splatting

SVDC: Consistent Direct Time-of-Flight Video Depth Completion with Frequency Selective Fusion

Sea-ing in Low-light

Consistency-aware Self-Training for Iterative-based Stereo Matching

SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos

4D-Fly: Fast 4D Reconstruction from a Single Monocular Video

Camera Resection from Known Line Pencils and a Radially Distorted Scanline

AnyMap: Learning a General Camera Model for Structure-from-Motion with Unknown Distortion in Dynamic Scenes

SSHNet: Unsupervised Cross-modal Homography Estimation via Problem Reformulation and Split Optimization

MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors

Relative Pose Estimation through Affine Corrections of Monocular Depth Priors

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos

GPVK-VL: Geometry-Preserving Virtual Keyframes for Visual Localization under Large Viewpoint Changes

Reloc3r: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization

Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-invariant Representation Learning

Self-Supervised Cross-View Correspondence with Predictive Cycle Consistency

Can Generative Video Models Help Pose Estimation?

Light3R-SfM: Towards Feed-forward Structure-from-Motion

BADGR: Bundle Adjustment Diffusion Conditioned by Gradients for Wide-Baseline Floor Plan Reconstruction

SAT-HMR: Real-Time Multi-Person 3D Mesh Estimation via Scale-Adaptive Tokens

HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation

Pos3R: 6D Pose Estimation for Unseen Objects Made Easy

ONDA-Pose: Occlusion-Aware Neural Domain Adaptation for Self-Supervised 6D Object Pose Estimation

Leveraging Global Stereo Consistency for Category-Level Shape and 6D Pose Estimation from Stereo Images

One-shot 3D Object Canonicalization based on Geometric and Semantic Consistency

SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images

SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking

MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking

Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World

All-Day Multi-Camera Multi-Target Tracking

Shape Abstraction via Marching Differentiable Support Functions

MESC-3D:Mining Effective Semantic Cues for 3D Reconstruction from a Single Image

Implicit Correspondence Learning for Image-to-Point Cloud Registration

Consistent Normal Orientation for 3D Point Clouds via Least Squares on Delaunay Graph

Zero-shot RGB-D Point Cloud Registration with Pre-trained Large Vision Model

SuperPC: A Single Diffusion Model for Point Cloud Completion, Upsampling, Denoising, and Colorization

Occlusion-aware Text-Image-Point Cloud Pretraining for Open-World 3D Object Recognition

PMA: Towards Parameter-Efficient Point Cloud Understanding via Point Mamba Adapter

Point Cloud Upsampling Using Conditional Diffusion Module with Adaptive Noise Suppression

Generalized Few-shot 3D Point Cloud Segmentation with Vision-Language Model

EdgeDiff: Edge-aware Diffusion Network for Building Reconstruction from Point Clouds

WeatherGen: A Unified Diverse Weather Generator for LiDAR Point Clouds via Spider Mamba Diffusion

FASTer: Focal token Acquiring-and-Scaling Transformer for Long-term 3D Objection Detection

LiSu: A Dataset and Method for LiDAR Surface Normal Estimation

DiffLO: Semantic-Aware LiDAR Odometry with Diffusion-Based Refinement

SharpDepth: Sharpening Metric Depth Predictions Using Diffusion Distillation

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion

ZeroVO: Visual Odometry with Minimal Assumptions

Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking

On-Device Self-Supervised Learning of Low-Latency Monocular Depth from Only Events

Toward Real-world BEV Perception: Depth Uncertainty Estimation via Gaussian Splatting

3D Occupancy Prediction with Low-Resolution Queries via Prototype-aware View Transformation

SOAP: Vision-Centric 3D Semantic Scene Completion with Scene-Adaptive Decoder and Occluded Region-Aware View Projection

VoteFlow: Enforcing Local Rigidity in Self-Supervised Scene Flow

VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving

InteractionMap: Improving Online Vectorized HDMap Construction with Interaction

DriveScape: High-Resolution Driving Video Generation by Multi-View Feature Fusion

T2SG: Traffic Topology Scene Graph for Topology Reasoning in Autonomous Driving

Scenario Dreamer: Vectorized Latent Diffusion for Generating Driving Simulation Environments

Leveraging SD Map to Augment HD Map-based Trajectory Prediction

Enduring, Efficient and Robust Trajectory Prediction Attack in Autonomous Driving via Optimization-Driven Multi-Frame Perturbation Framework

CarPlanner: Consistent Auto-regressive Trajectory Planning for Large-Scale Reinforcement Learning in Autonomous Driving

SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models

DriveGPT4-V2: Harnessing Large Language Model Capabilities for Enhanced Closed-Loop Autonomous Driving

Sim-to-Real Causal Transfer: A Metric Learning Approach to Causally-Aware Interaction Representations

MoFlow: One-Step Flow Matching for Human Trajectory Forecasting via Implicit Maximum Likelihood Estimation based Distillation

3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning

HandOS: 3D Hand Reconstruction in One Stage

MobileH2R: Learning Generalizable Human to Mobile Robot Handover Exclusively from Scalable and Diverse Synthetic Data

PDFactor: Learning Tri-Perspective View Policy Diffusion Field for Multi-Task Robotic Manipulation

GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding

Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions

Lift3D Policy: Lifting 2D Foundation Models for Robust 3D Robotic Manipulation

OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints

Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision

Two by Two: Learning Multi-Task Pairwise Objects Assembly for Generalizable Robot Manipulation

Spatial-Temporal Graph Diffusion Policy with Kinematic Modeling for Bimanual Robotic Manipulation

ZeroGrasp: Zero-Shot Shape Reconstruction Enabled Robotic Grasping

LatentHOI: On the Generalizable Hand Object Motion Generation with Latent Hand Diffusion.

Reconstructing In-the-Wild Open-Vocabulary Human-Object Interactions

BIGS: Bimanual Category-agnostic Interaction Reconstruction from Monocular Videos via 3D Gaussian Splatting

FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation

GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities

Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning

AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer

FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video

SyncSDE: A Probabilistic Framework for Diffusion Synchronization

Lifting Motion to the 3D World via 2D Diffusion

Motions as Queries: One-Stage Multi-Person Holistic Human Motion Capture

SkillMimic: Learning Basketball Interaction Skills from Demonstrations

Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model

SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance

Articulated Kinematics Distillation from Video Diffusion Models

Human Motion Instruction Tuning

EnergyMoGen: Compositional Human Motion Generation with Energy-Based Diffusion Model in Latent Space

VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models

FIction: 4D Future Interaction Prediction from Video

Mamba4D: Efficient 4D Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models

Layered Motion Fusion: Lifting Motion Segmentation to 3D in Egocentric Videos

TimeTracker: Event-based Continuous Point Tracking for Video Frame Interpolation with Non-linear Motion

Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors

LC-Mamba: Local and Continuous Mamba with Shifted Windows for Frame Interpolation

ObjectMover: Generative Object Movement with Video Prior

VideoHandles: Editing 3D Object Compositions in Videos Using Video Generative Priors

One-Minute Video Generation with Test-Time Training

Generative Video Propagation

4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion

Condensing Action Segmentation Datasets via Generative Network Inversion

Perceptual Video Compression with Neural Wrapping

EvEnhancer: Empowering Effectiveness, Efficiency and Generalizability for Continuous Space-Time Video Super-Resolution with Events

Plug-and-Play Versatile Compressed Video Enhancement

WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model

LongDiff: Training-Free Long Video Generation in One Go

PatchVSR: Breaking Video Diffusion Resolution Limits with Patch-wise Video Super-Resolution

DPFlow: Adaptive Optical Flow Estimation with a Dual-Pyramid Framework

Dynamic Content Prediction with Motion-aware Priors for Blind Face Video Restoration

LP-Diff: Towards Improved Restoration of Real-World Degraded License Plate

AlphaPre: Amplitude-Phase Disentanglement Model for Precipitation Nowcasting

Effective Cloud Removal for Remote Sensing Images by an Improved Mean-Reverting Denoising Model with Elucidated Design Space

Self-Learning Hyperspectral and Multispectral Image Fusion via Adaptive Residual Guided Subspace Diffusion Model

Adaptive Rectangular Convolution for Remote Sensing Pansharpening

Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond

Exposure-slot: Exposure-centric Representations Learning with Slot-in-Slot Attention for Region-aware Exposure Correction

CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution

ACL: Activating Capability of Linear Attention for Image Restoration

Positive2Negative: Breaking the Information-Lossy Barrier in Self-Supervised Single Image Denoising

From Zero to Detail: Deconstructing Ultra-High-Definition Image Restoration from Progressive Spectral Perspective

Multi-Modal Contrastive Masked Autoencoders: A Two-Stage Progressive Pre-training Approach for RGBD Datasets

Auto-Encoded Supervision for Perceptual Image Super-Resolution

UniRestore: Unified Perceptual and Task-Oriented Image Restoration Model Using Diffusion Prior

Uncertainty-guided Perturbation for Image Super-Resolution Diffusion Model

Image Quality Assessment: Investigating Causal Perceptual Effects with Abductive Counterfactual Inference

Using Powerful Prior Knowledge of Diffusion Model in Deep Unfolding Networks for Image Compressive Sensing

Accelerating Diffusion Transformer via Increment-Calibrated Caching with Channel-Aware Singular Value Decomposition

Optimizing for the Shortest Path in Denoising Diffusion Model

Acc3D: Accelerating Single Image to 3D Diffusion Models via Edge Consistency Guided Score Distillation

MambaIC: State Space Models for High-Performance Learned Image Compression

Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression

Simpler Diffusion: 1.5 FID on ImageNet512 with Pixel-space Diffusion

Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers

Attend to Not Attended: Structure-then-Detail Token Merging for Post-training DiT Acceleration

NoiseCtrl: A Sampling-Algorithm-Agnostic Conditional Generation Method for Diffusion Models

See Further When Clear: Curriculum Consistency Model

RayFlow: Instance-Aware Diffusion Acceleration via Adaptive Flow Trajectories

Improved Video VAE for Latent Video Diffusion Model

Pioneering 4-Bit FP Quantization for Diffusion Models: Mixup-Sign Quantization and Timestep-Aware Fine-Tuning

TinyFusion: Diffusion Transformers Learned Shallow

Towards Precise Scaling Laws for Video Diffusion Transformers

Less is More: Efficient Image Vectorization with Adaptive Parameterization

Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback

AniDoc: Animation Creation Made Easier

Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis

Autoregressive Distillation of Diffusion Transformers

EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation

TransPixeler: Advancing Text-to-Video Generation with Transparency

PTDiffusion: Free Lunch for Generating Optical Illusion Hidden Pictures with Phase-Transferred Diffusion Model

Difference Inversion: Interpolate and Isolate the Difference with Token Consistency for Image Analogy Generation

StyleSSP: Sampling StartPoint Enhancement for Training-free Diffusion-based Method for Style Transfer

Attention Distillation: A Unified Approach to Visual Characteristics Transfer

Style-Editor: Text-driven Object-centric Style Editing

Towards Scalable Human-aligned Benchmark for Text-guided Image Editing

PS-Diffusion: Photorealistic Subject-Driven Image Editing with Disentangled Control and Attention

Paint by Inpaint: Learning to Add Image Objects by Removing Them First

MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting

ATA: Adaptive Transformation Agent for Text-Guided Subject-Position Variable Background Inpainting

Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

Image is All You Need to Empower Large-scale Diffusion Models for In-Domain Generation

PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation

VODiff: Controlling Object Visibility Order in Text-to-Image Generation

Z-Magic: Zero-shot Multiple Attributes Guided Image Creator

Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis

Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis

DreamRelation: Bridging Customization and Relation Generation

Language-Guided Image Tokenization for Generation

Scaling Down Text Encoders of Text-to-Image Diffusion Models

Diffusion Self-Distillation for Zero-Shot Customized Image Generation

Redefining <Creative> in Dictionary: Towards an Enhanced Semantic Understanding of Creative Generation

Towards Transformer-Based Aligned Generation with Self-Coherence Guidance

Calibrated Multi-Preference Optimization for Aligning Diffusion Models

A4A: Adapter for Adapter Transfer via All-for-All Mapping for Cross-Architecture Models

Focus-N-Fix: Region-Aware Fine-Tuning for Text-to-Image Generation

SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation

CompGS: Unleashing 2D Compositionality for Compositional Text-to-3D via Dynamically Optimizing 3D Gaussians

Apply Hierarchical-Chain-of-Generation to Complex Attributes Text-to-3D Generation

Empowering Vector Graphics with Consistently Arbitrary Viewing and View-dependent Visibility

IDEA-Bench: How Far are Generative Models from Professional Designing?

Gen3DEval: Using vLLMs for Automatic Evaluation of Generated 3D Objects

CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation

BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing

Adversarial Domain Prompt Tuning and Generation for Single Domain Generalization

Localized Concept Erasure for Text-to-Image Diffusion Models Using Training-Free Gated Low-Rank Adaptation

Random Conditioning with Distillation for Data-Efficient Diffusion Model Compression

Efficient Fine-Tuning and Concept Suppression for Pruned Diffusion Models

Visual Persona: Foundation Model for Full-Body Human Customization

The Art of Deception: Color Visual Illusions and Diffusion Models

Harnessing Frequency Spectrum Insights for Image Copyright Protection Against Diffusion Models

Hiding Images in Diffusion Models by Editing Learned Score Functions

CDI: Copyrighted Data Identification in Diffusion Models

A Bias-Free Training Paradigm for More General AI-generated Image Detection

Task Singular Vectors: Reducing Task Interference in Model Merging

Any-Resolution AI-Generated Image Detection by Spectral Learning

DefectFill: Realistic Defect Generation with Inpainting Diffusion Model for Visual Inspection

End-to-End Implicit Neural Representations for Classification

A Flag Decomposition for Hierarchical Datasets

GazeGene: Large-scale Synthetic Gaze Dataset with 3D Eyeball Annotations

FIFA: Fine-grained Inter-frame Attention for Driver's Video Gaze Estimation

Video-Guided Foley Sound Generation with Multimodal Controls

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Circumventing Shortcuts in Audio-visual Deepfake Detection Datasets with Unsupervised Learning

PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

Mask^2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

SMTPD: A New Benchmark for Temporal Prediction of Social Media Popularity

Video-Bench: Human-Aligned Video Generation Benchmark

AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM

AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing

Apollo: An Exploration of Video Understanding in Large Multimodal Models

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

DrVideo: Document Retrieval Based Long Video Understanding

Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Video Summarization with Large Language Models

DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models

RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives

ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos

ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation

VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Unbiased Video Scene Graph Generation via Visual and Semantic Dual Debiasing

UniGoal: Towards Universal Zero-shot Goal-oriented Navigation

Semantic and Sequential Alignment for Referring Video Object Segmentation

SegMAN: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation

Correcting Deviations from Normality: A Reformulated Diffusion Model for Multi-Class Unsupervised Anomaly Detection

EchoTraffic: Enhancing Traffic Anomaly Understanding with Audio-Visual Insights

Noise-Resistant Video Anomaly Detection via RGB Error-Guided Multiscale Predictive Coding and Dynamic Memory

Understanding Multi-Task Activities from Single-Task Videos

Action Detail Matters: Refining Video Recognition with Local Action Queries

CountLLM: Towards Generalizable Repetitive Action Counting via Large Language Model

Heterogeneous Skeleton-Based Action Representation Learning

Dynamic Updates for Language Adaptation in Visual-Language Tracking

Boosting Adversarial Transferability through Augmentation in Hypothesis Space

UIBDiffusion: Universal Imperceptible Backdoor Attack for Diffusion Models

CryptoFace: End-to-End Encrypted Face Recognition

Forensics Adapter: Adapting CLIP for Generalizable Face Forgery Detection

D2SP: Dynamic Dual-Stage Purification Framework for Dual Noise Mitigation in Vision-based Affective Recognition.

Can't Slow Me Down: Learning Robust and Hardware-Adaptive Object Detectors against Latency Attacks for Edge Devices

Decision SpikeFormer: Spike-Driven Transformer for Decision Making

Identity-Clothing Similarity Modeling for Unsupervised Clothing Change Person Re-Identification

Cheb-GR: Rethinking K-nearest Neighbor Search in Re-ranking for Person Re-identification

Shift the Lens: Environment-Aware Unsupervised Camouflaged Object Detection

Point2RBox-v2: Rethinking Point-supervised Oriented Object Detection with Spatial Layout Among Instances

BOOTPLACE: Bootstrapped Object Placement with Detection Transformers

Minimizing Labeled, Maximizing Unlabeled: An Image-Driven Approach for Video Instance Segmentation

PolarNeXt: Rethink Instance Segmentation with Polar Representation

SAM2Object: Consolidating View Consistency via SAM2 for Zero-Shot 3D Instance Segmentation

COB-GS: Clear Object Boundaries in 3DGS Segmentation Based on Boundary-Adaptive Gaussian Splitting

DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation

SAM-REF: Introducing Image-Prompt Synergy during Interaction for Detail Enhancement in the Segment Anything Model

Believing is Seeing: Unobserved Object Detection using Generative Models

MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments

Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding

Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces

SKE-Layout: Spatial Knowledge Enhanced Layout Generation with LLMs

Zero-shot 3D Question Answering via Voxel-based Dynamic Token Compression

Empowering Large Language Models with 3D Situation Awareness

Visual Agentic AI for Spatial Reasoning with a Dynamic API

R2C: Mapping Room to Chessboard to Unlock LLM As Low-Level Action Planner

GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs

GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration

Empowering LLMs to Understand and Generate Complex Vector Graphics

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

SocialGesture: Delving into Multi-person Gesture Understanding

Interleaved-Modal Chain-of-Thought

AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modalities

MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts

MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

Towards General Visual-Linguistic Face Forgery Detection

Exploring Contextual Attribute Density in Referring Expression Counting

Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering

CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering

Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning

Learning with Noisy Triplet Correspondence for Composed Image Retrieval

ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval

PromptHash: Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing Retrieval

GENIUS: A Generative Framework for Universal Multimodal Search

Font-Agent: Enhancing Font Understanding with Large Language Models

Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching

Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature

Visual Lexicon: Rich Image Features in Language Space

Improving Personalized Search with Regularized Low-Rank Parameter Updates

AdaDARE-gamma: Balancing Stability and Plasticity in Multi-modal LLMs through Efficient Adaptation

FastVLM: Efficient Vision Encoding for Vision Language Models

Cross-modal Information Flow in Multimodal Large Language Models

VisionZip: Longer is Better but Not Necessary in Vision Language Models

TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model

A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs

Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent Spaces

ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models

MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders

PhD: A ChatGPT-Prompted Visual Hallucination Evaluation Dataset

SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models

Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models?

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

Anyattack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-language Models

TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models

On the Zero-shot Adversarial Robustness of Vision-Language Models: A Truly Zero-shot and Training-free Approach

Conformal Prediction for Zero-Shot Models

O-TPT: Orthogonality Constraints for Calibrating Test-time Prompt Tuning in Vision-Language Models

Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language

Navigation World Models

NLPrompt: Noise-Label Prompt Learning for Vision-Language Models

Preserving Clusters in Prompt Learning for Unsupervised Domain Adaptation

RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness

Test-Time Visual In-Context Tuning

F^3OCUS - Federated Finetuning of Vision-Language Foundation Models with Optimal Client Layer Updating Strategy via Multi-objective Meta-Heuristics

Towards Human-Understandable Multi-Dimensional Concept Discovery

From Prototypes to General Distributions: An Efficient Curriculum for Masked Image Modeling

Do Computer Vision Foundation Models Learn the Low-level Characteristics of the Human Visual System?

DepthCues: Evaluating Monocular Depth Perception in Large Vision Models

LaVin-DiT: Large Vision Diffusion Transformer

5%>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks

Efficient Data Driven Mixture-of-Expert Extraction from Trained Networks

Split Adaptation for Pre-trained Vision Transformers

Your Scale Factors are My Weapon: Targeted Bit-Flip Attacks on Vision Transformers via Scale Factor Manipulation

MDP: Multidimensional Vision Model Pruning with Latency Constraint

Mamba-Adaptor: State Space Model Adaptor for Visual Recognition

CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction

Coeff-Tuning: A Graph Filter Subspace View for Tuning Attention-Based Large Models

DVHGNN: Multi-Scale Dilated Vision HGNN for Efficient Vision Recognition

Graph-Embedded Structure-Aware Perceptual Hashing for Neural Network Protection and Piracy Detection

Hybrid Concept Bottleneck Models

Locality-Aware Zero-Shot Human-Object Interaction Detection

UNICL-SAM: Uncertainty-Driven In-Context Segmentation with Part Prototype Discovery

Dual Semantic Guidance for Open Vocabulary Semantic Segmentation

Exploring CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation

Improving Semi-Supervised Semantic Segmentation with Sliced-Wasserstein Feature Alignment and Uniformity

Soft Self-labeling and Potts Relaxations for Weakly-supervised Segmentation

Towards Efficient Foundation Model for Zero-shot Amodal Segmentation

Generalizable Object Keypoint Localization from Generative Priors

Visual and Semantic Prompt Collaboration for Generalized Zero-Shot Learning

Generalized Zero-Shot Classification via Semantics-Free Inter-Class Feature Generation

GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery

v-CLR: View-Consistent Learning for Open-World Instance Segmentation

Detecting Open World Objects via Partial Attribute Assignment

Uncertainty Meets Diversity: A Comprehensive Active Learning Framework for Indoor 3D Object Detection

Revisiting Generative Replay for Class Incremental Object Detection

Decoupled Distillation to Erase: A General Unlearning Method for Any Class-centric Tasks

Saliuitl: Ensemble Salience Guided Recovery of Adversarial Patches against CNNs

Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models

PatchGuard: Adversarially Robust Anomaly Detection and Localization through Vision Transformers and Pseudo Anomalies

Odd-One-Out: Anomaly Detection by Comparing with Neighbors

Dinomaly: The Less Is More Philosophy in Multi-Class Unsupervised Anomaly Detection

Distribution Prototype Diffusion Learning for Open-set Supervised Anomaly Detection

ATP: Adaptive Threshold Pruning for Efficient Data Encoding in Quantum Neural Networks

Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation

Masking meets Supervision: A Strong Learning Alliance

Scale Efficient Training for Large Datasets

Learning on Model Weights using Tree Experts

How to Merge Your Multimodal Models Over Time?

Revisiting Fairness in Multitask Learning: A Performance-Driven Approach for Variance Reduction

Enhancing Online Continual Learning with Plug-and-Play State Space Model and Class-Conditional Mixture of Discretization

Online Task-Free Continual Learning via Dynamic Expansionable Memory Distribution

Knowledge Memorization and Rumination for Pre-trained Model-based Class-Incremental Learning

COBRA: COmBinatorial Retrieval Augmentation for Few-Shot Adaptation

Dual Consolidation for Pre-Trained Model-Based Domain-Incremental Learning

Gradient-Guided Annealing for Domain Generalization

AdMiT: Adaptive Multi-Source Tuning in Dynamic Environments

Compositional Targeted Multi-Label Universal Perturbations

Forming Auxiliary High-confident Instance-level Loss to Promote Learning from Label Proportions

Dynamic Pseudo Labeling via Gradient Cutting for High-Low Entropy Exploration

ProHOC: Probabilistic Hierarchical Out-of-Distribution Classification via Multi-Depth Networks

Test-time Augmentation Improves Efficiency in Conformal Prediction

Subspace Constraint and Contribution Estimation for Heterogeneous Federated Learning

FedMIA: An Effective Membership Inference Attack Exploiting "All for One" Principle in Federated Learning

Detecting Backdoor Attacks in Federated Learning via Direction Alignment Inspection

RAEncoder: A Label-Free Reversible Adversarial Examples Encoder for Dataset Intellectual Property Protection

DeDe: Detecting Backdoor Samples for SSL Encoders via Decoders

Enhancing Adversarial Transferability with Checkpoints of a Single Model’s Training

Tightening Robustness Verification of MaxPool-based Neural Networks via Minimizing the Over-Approximation Zone

Theory-Inspired Deep Multi-View Multi-Label Learning with Incomplete Views and Noisy Labels

EASEMVC:Efficient Dual Selection Mechanism for Deep Multi-View Clustering

Large-scale Multi-view Tensor Clustering with Implicit Linear Kernels

Generative Modeling of Class Probability for Multi-Modal Representation Learning

Fuzzy Multimodal Learning for Trusted Cross-modal Retrieval

Rate-In: Information-Driven Adaptive Dropout Rates for Improved Inference-Time Uncertainty Estimation

STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection

CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning

MExD: An Expert-Infused Diffusion Model for Whole-Slide Image Classification

Advancing Multiple Instance Learning with Continual Learning for Whole Slide Imaging

Multi-modal Topology-embedded Graph Learning for Spatially Resolved Genes Prediction from Pathology Images with Prior Gene Similarity Information

The Impact Label Noise and Choice of Threshold has on Cross-Entropy and Soft-Dice in Image Segmentation

Show and Segment: Universal Medical Image Segmentation via In-Context Learning

Interactive Medical Image Segmentation: A Benchmark Dataset and Baseline

nnWNet: Rethinking the Use of Transformers in Biomedical Image Segmentation and Calling for a Unified Evaluation Benchmark

VISTA3D: A Unified Segmentation Foundation Model For 3D Medical Imaging

vesselFM: A Foundation Model for Universal 3D Blood Vessel Segmentation

MuTri: Multi-view Tri-alignment for OCT to OCTA 3D Image Translation

(ends 7:00 PM)

Demonstration:

Demos

(ends 7:00 PM)

7 p.m.

Reception:

Reception & Musical Performances

(ends 9:00 PM)

SUN 15 JUN

7:30 a.m.

Break:

Breakfast

(ends 9:00 AM)

Registration / Badge Pickup

(ends 2:30 PM)

8 a.m.

Poster Setup:

Poster Setup

(ends 8:30 AM)

9 a.m.

Oral Session 5A: Generative AI [9:00-10:15]

Orals 9:00-10:15

[9:00] Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing

[9:15] DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models

[9:30] CustAny: Customizing Anything from A Single Example

[9:45] Minority-Focused Text-to-Image Generation via Prompt Optimization

[10:00] Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models

(ends 10:15 AM)

Oral Session 5B: Learning Systems and Medical Applications [9:00-10:15]

Orals 9:00-10:15

[9:00] UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming

[9:15] Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning

[9:30] Enhancing Diversity for Data-free Quantization

[9:45] TopoCellGen: Generating Histopathology Cell Topology with a Diffusion Model

[10:00] Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation

(ends 10:15 AM)

Oral Session 5C: Visual and Spatial Computing [9:00-10:15]

Orals 9:00-10:15

[9:00] Adv-CPG: A Customized Portrait Generation Framework with Facial Adversarial Attacks

[9:15] Gromov–Wasserstein Problem with Cyclic Symmetry

[9:30] Time of the Flight of the Gaussians: Optimizing Depth Indirectly in Dynamic Radiance Fields

[9:45] Zero-Shot Monocular Scene Flow Estimation in the Wild

[10:00] 3D Student Splatting and Scooping

(ends 10:15 AM)

10:30 a.m.

Demonstration:

Demos

(ends 12:30 PM)

Art Program [10:30-6:00]

(ends 6:00 PM)

Poster Session 5 [10:30-12:30]

Posters 10:30-12:30

DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations

Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics

Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video Diffusion Transformer

StableAnimator: High-Quality Identity-Preserving Human Image Animation

IM-Portrait: Learning 3D-aware Video Diffusion for Photorealistic Talking Heads from Monocular VideosC

3D Gaussian Head Avatars with Expressive Dynamic Appearances by Compact Tensorial Representations

LUCAS: Layered Universal Codec Avatars

GeoAvatar: Geometrically-Consistent Multi-Person Avatar Reconstruction from Sparse Multi-View Videos

AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction

TAGA: Self-supervised Learning for Template-free Animatable Gaussian Articulated Model

DRiVE: Diffusion-based Rigging Empowers Generation of Versatile and Expressive Characters

MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

Unsupervised Discovery of Facial Landmarks and Head Pose

Data Synthesis with Diverse Styles for Face Recognition via 3DMM-Guided Diffusion

PGC: Physics-Based Gaussian Cloth from a Single Pose

Is this Generated Person Existed in Real-world? Fine-grained Detecting and Calibrating Abnormal Human-body

Enhancing Virtual Try-On with Synthetic Pairs and Error-Aware Noise Scheduling

ProjAttacker: A Configurable Physical Adversarial Attack for Face Recognition via Projector

ABC-Former: Auxiliary Bimodal Cross-domain Transformer with Interactive Channel Attention for White Balance

URWKV: Unified RWKV Model with Multi-state Perspective for Low-light Image Restoration

Efficient Diffusion as Low Light Enhancer

Noise Calibration and Spatial-Frequency Interactive Network for STEM Image Enhancement

DOF-GS: Adjustable Depth-of-Field 3D Gaussian Splatting for Post-Capture Refocusing, Defocus Rendering and Blur Removal

ReCap: Better Gaussian Relighting with Cross-Environment Captures

Factored-NeuS: Reconstructing Surfaces, Illumination, and Materials of Possibly Glossy Objects

SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes

Radio Frequency Ray Tracing with Neural Object Representation for Enhanced RF Modeling

GoLF-NRT: Integrating Global Context and Local Geometry for Few-Shot View Synthesis

3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes

Volumetric Surfaces: Representing Fuzzy Geometries with Layered Meshes

MetricGrids: Arbitrary Nonlinear Approximation with Elementary Metric Grids based Implicit Neural Representation

Mani-GS: Gaussian Splatting Manipulation with Triangular Mesh

TriTex: Learning Texture from a Single Mesh via Triplane Semantic Features

HybridMQA: Exploring Geometry-Texture Interactions for Colored Mesh Quality Assessment

ARM: Appearance Reconstruction Model for Relightable 3D Generation

DTGBrepGen: A Novel B-rep Generative Model through Decoupling Topology and Geometry

CADDreamer: CAD Object Generation from Single-view Images

Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis and Manipulation

Structured 3D Latents for Scalable and Versatile 3D Generation

Hash3D: Training-free Acceleration for 3D Generation

SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion

Dragin3D: Image Editing by Dragging in 3D Space

Deformable Radial Kernel Splatting

SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis

Speedy-Splat: Fast 3D Gaussian Splatting with Sparse Pixels and Sparse Primitives

GS-2DGS: Geometrically Supervised 2DGS for Reflective Object Reconstruction

High-fidelity 3D Object Generation from Single Image with RGBN-Volume Gaussian Reconstruction Model

MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models

LITA-GS: Illumination-Agnostic Novel View Synthesis via Reference-Free 3D Gaussian Splatting and Physical Priors

Splatter-360: Generalizable 360 Gaussian Splatting for Wide-baseline Panoramic Images

DropGaussian: Structural Regularization for Sparse-view Gaussian Splatting

Panorama Generation From NFoV Image Done Right

SIR-DIFF: Sparse Image Sets Restoration with Multi-View Diffusion Model

Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion

MonoInstance: Enhancing Monocular Priors via Multi-view Instance Alignment for Neural Rendering and Reconstruction

SfM-Free 3D Gaussian Splatting via Hierarchical Training

MVBoost: Boost 3D Reconstruction with Multi-View Refinement

AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis

World-consistent Video Diffusion with Explicit 3D Modeling

Improving Gaussian Splatting with Localized Points Management

RelationField: Relate Anything in Radiance Fields

GS-DiT: Advancing Video Generation with Dynamic 3D Gaussian Fields through Efficient Dense 3D Point Tracking

DynaMoDe-NeRF: Motion-aware Deblurring Neural Radiance Field for Dynamic Scenes

DiET-GS: Diffusion Prior and Event Stream-Assisted Motion Deblurring 3D Gaussian Splatting

FreeTimeGS: Free Gaussian Primitives at Anytime Anywhere for Dynamic Scene Reconstruction

GIFStream: 4D Gaussian-based Immersive Video with Feature Stream

DRAWER: Digital Reconstruction and Articulation With Environment Realism

Gromov–Wasserstein Problem with Cyclic Symmetry

Dynamic Neural Surfaces for Elastic 4D Shape Representation and Analysis

Higher-Order Ratio Cycles for Fast and Globally Optimal Shape Matching

Event Ellipsometer: Event-based Mueller-Matrix Video Imaging

AniGrad: Anisotropic Gradient-Adaptive Sampling for 3D Reconstruction From Monocular Video

ODHSR: Online Dense 3D Reconstruction of Humans and Scenes from Monocular Videos

All-directional Disparity Estimation for Real-world QPD Images

Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion

DEFOM-Stereo: Depth Foundation Model Based Stereo Matching

Improved Monocular Depth Prediction Using Distance Transform Over Pre-semantic Contours with Self-supervised Neural Networks

Synthetic-to-Real Self-supervised Robust Depth Estimation via Learning with Motion and Structure Priors

MP-SfM: Monocular Surface Priors for Robust Structure-from-Motion

PRaDA: Projective Radial Distortion Averaging

Practical Solutions to the Relative Pose of Three Calibrated Cameras

Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass

FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views

Time of the Flight of the Gaussians: Optimizing Depth Indirectly in Dynamic Radiance Fields

Reconstructing People, Places, and Cameras

Omnidirectional Multi-Object Tracking

CoMatcher: Multi-View Collaborative Feature Matching

PIDLoc: Cross-View Pose Optimization Network Inspired by PID Controllers

BLADE: Single-view Body Mesh Estimation through Accurate Depth Estimation

4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models

SelfSplat: Pose-Free and 3D Prior-Free Generalizable 3D Gaussian Splatting

UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference Image

Recurrent Feature Mining and Keypoint Mixup Padding for Category-Agnostic Pose Estimation

SCFlow2: Plug-and-Play Object Pose Refiner with Shape-Constraint Scene Flow

GIVEPose: Gradual Intra-class Variation Elimination for RGB-based Category-Level Object Pose Estimation

Robust Multi-Object 4D Generation for In-the-wild Videos

Category-Agnostic Neural Object Rigging

PURA: Parameter Update-Recovery Test-Time Adaption for RGB-T Tracking

ACAttack: Adaptive Cross Attacking RGB-T Tracker via Multi-Modal Response Decoupling

Leveraging 3D Geometric Priors in 2D Rotation Symmetry Detection

SeCap: Self-Calibrating and Adaptive Prompts for Cross-view Person Re-Identification in Aerial-Ground Networks

HyperNet Fields: Efficiently Training Hypernetworks without Ground Truth by Learning Weight Trajectories

Learning Bijective Surface Parameterization for Inferring Signed Distance Functions from Sparse Point Clouds with Grid Deformation

EdgeMovingNet: Edge-preserving Point Cloud Reconstruction via Joint Geometry Features

GraphI2P: Image-to-Point Cloud Registration with Exploring Pattern of Correspondence via Graph Learning

RENO: Real-Time Neural Compression for 3D LiDAR Point Clouds

Point Clouds Meets Physics: Dynamic Acoustic Field Fitting Network for Point Cloud Understanding

Sonata: Self-Supervised Learning of Reliable Point Representations

Generative Hard Example Augmentation for Semantic Point Cloud Segmentation

BWFormer: Building Wireframe Reconstruction from Airborne LiDAR Point Cloud with Transformer

Cubify Anything: Scaling Indoor 3D Object Detection

Multi-Scale Neighborhood Occupancy Masked Autoencoder for Self-Supervised Learning in LiDAR Point Clouds

Unlocking Generalization Power in LiDAR Point Cloud Registration

Distilling Monocular Foundation Model for Fine-grained Depth Completion

MonoTAKD: Teaching Assistant Knowledge Distillation for Monocular 3D Object Detection

RICCARDO: Radar Hit Prediction and Convolution for Camera-Radar 3D Object Detection

UCM-VeID V2: A Richer Dataset and A Pre-training Method for UAV Cross-Modality Vehicle Re-Identification

SparseAlign: a Fully Sparse Framework for Cooperative Object Detection

Hyperdimensional Uncertainty Quantification for Multimodal Uncertainty Fusion in Autonomous Vehicles Perception

Omni-Scene: Omni-Gaussian Representation for Ego-Centric Sparse-View Scene Reconstruction

Floxels: Fast Unsupervised Voxel Based Scene Flow Estimation

Spatiotemporal Decoupling for Efficient Vision-Based Occupancy Forecasting

Rectification-specific Supervision and Constrained Estimator for Online Stereo Rectification

Uncertainty-Instructed Structure Injection for Generalizable HD Map Construction

JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration

MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction

GenAssets: Generating in-the-wild 3D Assets in Latent Space

GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

Continuous Locomotive Crowd Behavior Generation

Don't Shake the Wheel: Momentum-Aware Planning in End-to-End Autonomous Driving

OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning

Embodied Scene Understanding for Vision Language Models via MetaVQA

SocialMOIF: Multi-Order Intention Fusion for Pedestrian Trajectory Prediction

Unified Uncertainty-Aware Diffusion for Multi-Agent Trajectory Modeling

RADIOv2.5: Improved Baselines for Agglomerative Vision Foundation Models

Let Humanoids Hike! Integrative Skill Development on Complex Trails

Universal Actions for Enhanced Embodied Foundation Models

Tartan IMU: A Light Foundation Model for Inertial Positioning in Robotics

3D-MVP: 3D Multiview Pretraining for Manipulation

RoboGround: Robotic Manipulation with Grounded Vision-Language Priors

Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation

Prof. Robot: Differentiable Robot Rendering Without Static and Self-Collisions

Think Small, Act Big: Primitive Prompt Learning for Lifelong Robot Manipulation

DexGrasp Anything: Towards Universal Robotic Dexterous Grasping with Physics Awareness

CheckManual: A New Challenge and Benchmark for Manual-based Appliance Manipulation

InteractVLM: 3D Interaction Reasoning from 2D Foundational Models

VTON-HandFit: Virtual Try-on for Arbitrary Hand Pose Guided by Hand Priors Embedding

Pose-Guided Temporal Enhancement for Robust Low-Resolution Hand Reconstruction

GaPT-DAR: Category-level Garments Pose Tracking via Integrated 2D Deformation and 3D Reconstruction

Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction

A Focused Human Body Model for Accurate Anthropometric Measurements Extraction

Ego4o: Egocentric Human Motion Capture and Understanding from Multi-Modal Input

MotionMap: Representing Multimodality in Human Pose Forecasting

POMP: Physics-constrainable Motion Generative Model through Phase Manifolds

H-MoRe: Learning Human-centric Motion Representation for Action Analysis

Guiding Human-Object Interactions with Rich Geometry and Relations

Deterministic-to-Stochastic Diverse Latent Feature Mapping for Human Motion Synthesis

Dynamic Motion Blending for Versatile Motion Editing

AToM: Aligning Text-to-Motion Model at Event-Level with GPT-4Vision Reward

PersonaBooth: Personalized Text-to-Motion Generation

Move-in-2D: 2D-Conditioned Human Motion Generation

PoseTraj: Pose-Aware Trajectory Control in Video Diffusion

Robust 3D Shape Reconstruction in Zero-Shot from a Single Image in the Wild

Zero-Shot Monocular Scene Flow Estimation in the Wild

HuPerFlow: A Comprehensive Benchmark for Human vs. Machine Motion Estimation Comparison

Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better

Align3R: Aligned Monocular Depth Estimation for Dynamic Videos

Video Depth Anything: Consistent Depth Estimation for Super-Long Videos

Learning Temporally Consistent Video Depth from Video Diffusion Priors

Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction

MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation

AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers

Using Diffusion Priors for Video Amodal Segmentation

VideoSPatS: Video SPatiotemporal Splines for Disentangled Occlusion, Appearance and Motion Modeling and Editing

Video Motion Transfer with Diffusion Transformers

VidTwin: Video VAE with Decoupled Structure and Dynamics

HyperNVD: Accelerating Neural Video Decomposition via Hypernetworks

Hierarchical Flow Diffusion for Efficient Frame Interpolation

HomoGen: Enhanced Video Inpainting via Homography Propagation and Diffusion

From Slow Bidirectional to Fast Autoregressive Video Diffusion Models

Blind Bitstream-corrupted Video Recovery via Metadata-guided Diffusion Model

VidSeg: Training-free Video Semantic Segmentation based on Diffusion Models

Towards More General Video-based Deepfake Detection through Facial Component Guided Adaptation for Foundation Model

Parameterized Blur Kernel Prior Learning for Local Motion Deblurring

Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

Cross-Rejective Open-Set SAR Image Registration

Meta-Learning Hyperparameters for Parameter Efficient Fine-Tuning

HyperFree: A Channel-adaptive and Tuning-free Foundation Model for Hyperspectral Remote Sensing Imagery

MINIMA: Modality Invariant Image Matching

U-Know-DiffPAN: An Uncertainty-aware Knowledge Distillation Diffusion Framework with Details Enhancement for PAN-Sharpening

QMambaBSR: Burst Image Super-Resolution with Query State Space Model

Learning Hazing to Dehazing: Towards Realistic Haze Generation for Real-World Image Dehazing

ADD: Attribution-Driven Data Augmentation Framework for Boosting Image Super-Resolution

Gyro-based Neural Single Image Deblurring

UHD-processer: Unified UHD Image Restoration with Progressive Frequency Learning and Degradation-aware Prompts

AutoLUT: LUT-Based Image Super-Resolution with Automatic Sampling and Adaptive Residual Learning

The Power of Context: How Multimodality Improves Image Super-Resolution

Arbitrary-steps Image Super-resolution via Diffusion Inversion

Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing

Understanding Multi-layered Transmission Matrices

TSD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution

FiRe: Fixed-points of Restoration Priors for Solving Inverse Problems

Acquire and then Adapt: Squeezing out Text-to-Image Model for Image Restoration

Reconciling Stochastic and Deterministic Strategies for Zero-shot Image Restoration using Diffusion Model in Dual

SoftShadow: Leveraging Soft Masks for Penumbra-Aware Shadow Removal

Finding Local Diffusion Schrödinger Bridge using Kolmogorov-Arnold Network

Towards Enhanced Image Inpainting: Mitigating Unwanted Object Insertion and Preserving Color Consistency

Fitted Neural Lossless Image Compression

Good, Cheap, and Fast: Overfitted Image Compression with Wasserstein Distortion

CacheQuant: Comprehensively Accelerated Diffusion Models

Decouple-Then-Merge: Finetune Diffusion Models as Multi-Task Learning

Decoupling Training-Free Guided Diffusion by ADMM

UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming

DiverseFlow: Sample-Efficient Diverse Mode Coverage in Flows

PCM : Picard Consistency Model for Fast Parallel Sampling of Diffusion Models

Decentralized Diffusion Models

Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient

Easy-editable Image Vectorization with Multi-layer Multi-scale Distributed Visual Feature Embedding

SketchAgent: Language-Driven Sequential Sketch Generation

Animate and Sound an Image

SketchVideo: Sketch-based Video Generation and Editing

Image Referenced Sketch Colorization Based on Animation Creation Workflow

Unity in Diversity: Video Editing via Gradient-Latent Purification

Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation

Chebyshev Attention Depth Permutation Texture Network with Latent Texture Attribute Loss

HSI: A Holistic Style Injector for Arbitrary Style Transfer

StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements

ZoomLDM: Latent Diffusion Model for Multi-scale Image Generation

Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models

Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing

Visual Representation Learning through Causal Intervention for Controllable Image Editing

MetaWriter: Personalized Handwritten Text Recognition Using Meta-Learned Prompt Tuning

ACE: Anti-Editing Concept Erasure in Text-to-Image Models

Goku: Flow Based Video Generative Foundation Models

Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects

Adapting Text-to-Image Generation with Feature Difference Instruction for Generic Image Restoration

Boost Your Human Image Generation Model via Direct Preference Optimization

DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models

One-Way Ticket: Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models

Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis

LaTexBlend: Scaling Multi-concept Customized Generation with Latent Textual Blending

Minority-Focused Text-to-Image Generation via Prompt Optimization

Devil is in the Detail: Towards Injecting Fine Details of Image Prompt in Image Generation via Conflict-free Guidance and Stratified Attention

Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

CustAny: Customizing Anything from A Single Example

BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation

Learning to Sample Effective and Diverse Prompts for Text-to-Image Generation

ReNeg: Learning Negative Embedding with Reward Guidance

MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

ROICtrl: Boosting Instance Control for Visual Generation

Turbo3D: Ultra-fast Text-to-3D Generation

WeGen: A Unified Model for Interactive Multimodal Generation as We Chat

Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models

AesthetiQ: Enhancing Graphic Layout Design via Aesthetic-Aware Preference Alignment of Multi-modal Large Language Models

Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models

Design2GarmentCode: Turning Design Concepts to Tangible Garments Through Program Synthesis

AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models

Adv-CPG: A Customized Portrait Generation Framework with Facial Adversarial Attacks

ICE: Intrinsic Concept Extraction from a Single Image via Diffusion Models

Multi-Group Proportional Representations for Text-to-Image Models

What Makes a Good Dataset for Knowledge Distillation?

STEREO: A Two-Stage Framework for Adversarially Robust Concept Erasing from Text-to-Image Diffusion Models

PersonaHOI: Effortlessly Improving Face Personalization in Human-Object Interaction Generation

Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking

Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI

Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing

Training Data Provenance Verification: Did Your Model Use Synthetic Data from My Generative Model for Training?

Towards Universal AI-Generated Image Detection by Variational Information Bottleneck Network

NightAdapter: Learning a Frequency Adapter for Generalizable Night-time Scene Segmentation

D^3: Scaling Up Deepfake Detection by Learning from Discrepancy

Wavelet and Prototype Augmented Query-based Transformer for Pixel-level Surface Defect Detection

Neuro-3D: Towards 3D Visual Decoding from EEG Signals

Spectral State Space Model for Rotation-Invariant Visual Representation Learning

3D Prior Is All You Need: Cross-Task Few-shot 2D Gaze Estimation

HD-EPIC: A Highly-Detailed Egocentric Video Dataset

Customized Condition Controllable Generation for Video Soundtrack

Learning to Highlight Audio by Watching Movies

Supervising Sound Localization by In-the-wild Egomotion

TSAM: Temporal SAM Augmented with Multimodal Prompts for Referring Audio-Visual Segmentation

Audio-Visual Semantic Graph Network for Audio-Visual Event Localization

Language-Guided Audio-Visual Learning for Long-Term Sports Assessment

Mimir: Improving Video Diffusion Models for Precise Text Understanding

Mind the Time: Temporally-Controlled Multi-Event Video Generation

HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation

Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective

DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification

When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations

DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos

Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions

Cross-modal Causal Relation Alignment for Video Question Grounding

Can Text-to-Video Generation help Video-Language Alignment?

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models

Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos

Weakly Supervised Temporal Action Localization via Dual-Prior Collaborative Learning Guided by Multimodal Large Language Models

MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval

Efficient Motion-Aware Video MLLM

Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

Anchor-Aware Similarity Cohesion in Target Frames Enables Predicting Temporal Moment Boundaries in 2D

Object-Shot Enhanced Grounding Network for Egocentric Video

Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events

Exploiting Temporal State Space Sharing for Video Semantic Segmentation

Multi-modal Knowledge Distillation-based Human Trajectory Forecasting

EntitySAM: Segment Everything in Video

SLADE: Shielding against Dual Exploits in Large Vision-Language Models

A Distractor-Aware Memory for Visual Object Tracking with SAM2

Just Dance with pi! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection

Fish-Vista: A Multi-Purpose Dataset for Understanding & Identification of Traits from Images

DiGIT: Multi-Dilated Gated Encoder and Central-Adjacent Region Integrated Decoder for Temporal Action Detection Transformer

LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living

Are Spatial-Temporal Graph Convolution Networks for Human Action Recognition Over-Parameterized?

VSNet: Focusing on the Linguistic Characteristics of Sign Language

Instant Adversarial Purification with Adversarial Consistency Distillation

Low-Rank Adaptation in Multilinear Operator Networks for Security-Preserving Incremental Learning

Optimal Transport-Guided Source-Free Adaptation for Face Anti-Spoofing

FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning

Test-Time Backdoor Detection for Object Detection Models

Inference-Scale Complexity in ANN-SNN Conversion for High-Performance and Low-Power Applications

Spiking Transformer: Introducing Accurate Addition-Only Spiking Self-Attention for Transformer

From Poses to Identity: Training-Free Person Re-Identification via Feature Centralization

Learning from Synchronization: Self-Supervised Uncalibrated Multi-View Person Association in Challenging Scenes

ReDiffDet: Rotation-equivariant Diffusion Model for Oriented Object Detection

Taste More, Taste Better: Diverse Data and Strong Model Boost Semi-Supervised Crowd Counting

SmartEraser: Remove Anything from Images using Masked-Region Guidance

Towards Generalizable Scene Change Detection

SUM Parts: Benchmarking Part-Level Semantic Segmentation of Urban Meshes

Scene-Centric Unsupervised Panoptic Segmentation

Foveated Instance Segmentation

Zero-Shot 4D Lidar Panoptic Segmentation

Repurposing Stable Diffusion Attention for Training-Free Unsupervised Interactive Segmentation

Cross-Modal Distillation for 2D/3D Multi-Object Discovery from 2D Motion

Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene

Functionality Understanding and Segmentation in 3D Scenes

3D Student Splatting and Scooping

Chain of Semantics Programming in 3D Gaussian Splatting Representation for 3D Vision Grounding

Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis

ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding

ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark

TANGO: Training-free Embodied AI Agents for Open-world Tasks

ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

The Scene Language: Representing Scenes with Programs, Words, and Embeddings

Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

ViUniT: Visual Unit Tests for More Robust Visual Programming

VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Mutimodal Models

RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings

EmoEdit: Evoking Emotions through Image Manipulation

Uncertain Multimodal Intention and Emotion Understanding in the Wild

F-LMM: Grounding Frozen Large Multimodal Models

Reasoning to Attend: Try to Understand How <SEG> Token Works

MIMO: A Medical Vision Language Model with Visual Referring Multimodal Input and Pixel Grounding Multimodal Output

DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution

Separation of Powers: On Segregating Knowledge from Observation in LLM-enabled Knowledge-based Visual Question Answering

FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment

Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval

SynTab-LLaVA: Enhancing Multimodal Table Understanding with Decoupled Synthesis

The Photographer's Eye: Teaching Multimodal Large Language Models to See, and Critique Like Photographers

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data

Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection

CLIP is Almost All You Need: Towards Parameter-Efficient Scene Text Retrieval without OCR

FLAIR: VLM with Fine-grained Language-informed Image Representations

Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation

DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment

Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

It’s a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

DH-Set: Improving Vision-Language Alignment with Diverse and Hybrid Set-Embeddings Learning

Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens

MMRL: Multi-Modal Representation Learning for Vision-Language Models

LoRA Recycle: Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models by Recycling Pre-Tuned LoRAs

Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments

Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models

Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement

PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models

ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models

Realistic Test-Time Adaptation of Vision-Language Models

Low-Biased General Annotated Dataset Generation

Joint Scheduling of Causal Prompts and Tasks for Multi-Task Learning

Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning

ProAPO: Progressively Automatic Prompt Optimization for Visual Classification

Embracing Collaboration Over Competition: Condensing Multiple Prompts for Visual In-Context Learning

Comprehensive Information Bottleneck for Unveiling Universal Attribution to Interpret Vision Transformers

CustomKD: Customizing Large Vision Foundation for Edge Model Improvement via Knowledge Distillation

Beyond Local Sharpness: Communication-Efficient Global Sharpness-aware Minimization for Federated Learning

Do ImageNet-trained Models Learn Shortcuts? The Impact of Frequency Shortcuts on Generalization

Minimal Interaction Seperated Tuning: A New Paradigm for Visual Adaptation

DiTASK: Multi-Task Fine-Tuning with Diffeomorphic Transformations

Closest Neighbors are Harmful for Lightweight Masked Auto-encoders

GliaNet: Adaptive Neural Network Structure Learning with Glia-Driven

From Alexnet to Transformers: Measuring the Non-linearity of Deep Neural Networks with Affine Optimal Transport

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

Breaking the Low-Rank Dilemma of Linear Attention

ShiftwiseConv: Small Convolutional Kernel with Large Kernel Effect

Star with Bilinear Mapping

Your ViT is Secretly an Image Segmentation Model

Samba: A Unified Mamba-based Framework for General Salient Object Detection

HORP: Human-Object Relation Priors Guided HOI Detection

T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting

DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation

Text Augmented Correlation Transformer For Few-shot Classification & Segmentation

Golden Cudgel Network for Real-Time Semantic Segmentation

WISH: Weakly Supervised Instance Segmentation using Heterogeneous Labels

Hierarchical Compact Clustering Attention (COCA) for Unsupervised Object-Centric Learning

Incremental Object Keypoint Learning

Logits DeConfusion with CLIP for Few-Shot Learning

OCRT: Boosting Foundation Models in the Open World with Object-Concept-Relation Triad

Correlative and Discriminative Label Grouping for Multi-Label Visual Prompt Tuning

Adaptive Part Learning for Fine-Grained Generalized Category Discovery: A Plug-and-Play Enhancement

OW-OVD: Unified Open World and Open Vocabulary Object Detection

SEEN-DA: SEmantic ENtropy guided Domain-aware Attention for Domain Adaptive Object Detection

Detect Any Mirrors: Boosting Learning Reliability on Large-Scale Unlabeled Data with an Iterative Data Engine

Project-Probe-Aggregate: Efficient Fine-Tuning for Group Robustness

Enhancing Diversity for Data-free Quantization

Gain from Neighbors: Boosting Model Robustness in the Wild via Adversarial Perturbations Toward Neighboring Classes

Unseen Visual Anomaly Generation

MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection Dataset for Tiny Objects

A Unified Latent Schrödinger Bridge Diffusion Model for Unsupervised Anomaly Detection and Localization

TailedCore: Few-Shot Sampling for Unsupervised Long-Tail Noisy Anomaly Detection

Potential Field Based Deep Metric Learning

Inversion Circle Interpolation: Diffusion-based Image Augmentation for Data-scarce Classification

Dataset Distillation with Neural Characteristic Function: A Minmax Perspective

Towards Stable and Storage-efficient Dataset Distillation: Matching Convexified Trajectory

Data Distributional Properties As Inductive Bias for Systematic Generalization

SoMA: Singular Value Decomposed Minor Components Adaptation for Domain Generalizable Representation Learning

BiLoRA: Almost-Orthogonal Parameter Spaces for Continual Learning

DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models

Effortless Active Labeling for Long-Term Test-Time Adaptation

SEC-Prompt:SEmantic Complementary Prompting for Few-Shot Class-Incremental Learning

Attraction Diminishing and Distributing for Few-Shot Class-Incremental Learning

CoMBO: Conflict Mitigation via Branched Optimization for Class Incremental Segmentation

Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning

Joint Out-of-Distribution Filtering and Data Discovery Active Learning

Revisiting Source-Free Domain Adaptation: Insights into Representativeness, Generalization, and Variety

Identifying and Mitigating Spurious Correlation in Multi-Task Learning

Language-Assisted Debiasing and Smoothing for Foundation Model-Based Semi-Supervised Learning

Weakly Supervised Contrastive Adversarial Training for Learning Robust Features from Semi-supervised Data

Dual Energy-Based Model with Open-World Uncertainty Estimation for Out-of-distribution Detection

Directional Label Diffusion Model for Learning from Noisy Labels

A Simple Data Augmentation for Feature Distribution Skewed Federated Learning

NoT: Federated Unlearning via Weight Negation

Infighting in the Dark: Multi-Label Backdoor Attack in Federated Learning

TAROT: Towards Essentially Domain-Invariant Robustness with Theoretical Justification

Invisible Backdoor Attack against Self-supervised Learning

Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation

Improving Transferable Targeted Attacks with Feature Tuning Mixup

TopoCellGen: Generating Histopathology Cell Topology with a Diffusion Model

Improving Accuracy and Calibration via Differentiated Deep Mutual Learning

SeqMvRL: A Sequential Fusion Framework for Multi-view Representation Learning

Attribute-Missing Multi-view Graph Clustering

Finsler Multi-Dimensional Scaling: Manifold Learning for Asymmetric Dimensionality Reduction and Embedding

Adaptive Unimodal Regulation for Balanced Multimodal Information Acquisition

Knowledge Bridger: Towards Training-Free Missing Modality Completion

OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection

CheXwhatsApp: A Dataset for Exploring Challenges in the Diagnosis of Chest X-rays through Mobile Devices

Bringing CLIP to the Clinic: Dynamic Soft Labels and Negation-Aware Learning for Medical Analysis

Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation

ODA-GAN: Orthogonal Decoupling Alignment GAN Assisted by Weakly-supervised Learning for Virtual Immunohistochemistry Staining

STINR: Deciphering Spatial Transcriptomics via Implicit Neural Representation

A Semantic Knowledge Complementarity based Decoupling Framework for Semi-supervised Class-imbalanced Medical Image Segmentation

Boltzmann Attention Sampling for Image Analysis with Small Objects

Boosting the Dual-Stream Architecture in Ultra-High Resolution Segmentation with Resolution-Biased Uncertainty Estimation

Advancing Generalizable Tumor Segmentation with Anomaly-Aware Open-Vocabulary Attention Maps and Frozen Foundation Diffusion Models

Incomplete Multi-modal Brain Tumor Segmentation via Learnable Sorting State Space Model

EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance

A Unified Model for Compressed Sensing MRI Across Undersampling Patterns

CARL: A Framework for Equivariant Image Registration

(ends 12:30 PM)

11 a.m.

Art Gallery Tour with Curator, Luba Elliott [11:00-11:30]

Chairs: Luba Elliott

(ends 11:30 AM)

1 p.m.

Oral Session 6A: 3D from Single or Multi-View Sensors [1:00-2:30]

Orals 1:00-2:15

[1:00] DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models

[1:15] 3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting

[1:30] DNF: Unconditional 4D Generation with Dictionary-based Neural Fields

[1:45] CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

[2:00] Diffusion Renderer: Neural Inverse and Forward Rendering with Video Diffusion Models

(ends 2:30 PM)

Oral Session 6B: Scene Understanding, Image Editing and Multimodal Learning [1:00-2:30]

Orals 1:00-2:30

[1:00] Effective SAM Combination for Open-Vocabulary Semantic Segmentation

[1:15] FluidNexus: 3D Fluid Reconstruction and Prediction from a Single Video

[1:30] Birth and Death of a Rose

[1:45] Semi-Supervised State-Space Model with Dynamic Stacking Filter for Real-World Video Deraining

[2:00] AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea

[2:15] Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens

(ends 2:30 PM)

Oral Session 6C: Video, Action, and Language [1:00-2:30]

Orals 1:00-2:30

[1:00] Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding

[1:15] Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

[1:30] LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models

[1:45] VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

[2:00] SEAL: Semantic Attention Learning for Long Video Representation

[2:15] Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval

(ends 2:30 PM)

1:30 p.m.

Panel Discussion [1:30-2:30]

(ends 2:30 PM)

2:45 p.m.

Keynote:

Gemini Robotics, Bringing AI to the Physical World

Carolina Parada

(ends 3:45 PM)

4 p.m.

Poster Session 6 [4:00-6:00]

Posters 4:00-6:00

EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation

Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation

MVPortrait: Text-Guided Motion and Emotion Control for Multi-view Vivid Portrait Animation

Free-viewpoint Human Animation with Pose-correlated Reference Selection

DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis

MeGA: Hybrid Mesh-Gaussian Head Avatar for High-Fidelity Rendering and Head Editing

HRAvatar: High-Quality and Relightable Gaussian Head Avatar

Real-time High-fidelity Gaussian Human Avatars with Position-based Interpolation of Spatially Distributed MLPs

IDOL: Instant Photorealistic 3D Human Creation from a Single Image

Birth and Death of a Rose

DNF: Unconditional 4D Generation with Dictionary-based Neural Fields

SimAvatar: Simulation-Ready Avatars with Layered Hair and Clothing

Disco4D: Disentangled 4D Human Generation and Animation from a Single Image

StdGEN: Semantic-Decomposed 3D Character Generation from Single Images

T-FAKE: Synthesizing Thermal Images for Facial Landmarking

Diff-Palm: Realistic Palmprint Generation with Polynomial Creases and Intra-Class Variation Controllable Diffusion Models

GBC-Splat: Generalizable Gaussian-Based Clothed Human Digitalization under Sparse RGB Cameras

VTON 360: High-Fidelity Virtual Try-On from Any Viewing Direction

BooW-VTON: Boosting In-the-Wild Virtual Try-On via Mask-Free Pseudo Data Training

SFDM: Robust Decomposition of Geometry and Reflectance for Realistic Face Rendering from Sparse-view Images

Integral Fast Fourier Color Constancy

Reversible Decoupling Network for Single Image Reflection Removal

Stabilizing and Accelerating Autofocus with Expert Trajectory Regularized Deep Reinforcement Learning

V2V3D: View-to-View Denoised 3D Reconstruction for Light Field Microscopy

DoF-Gaussian: Controllable Depth-of-Field for 3D Gaussian Splatting

Luminance-GS: Adapting 3D Gaussian Splatting to Challenging Lighting Conditions with View-Adaptive Curve Adjustment

3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting

Diffusion Renderer: Neural Inverse and Forward Rendering with Video Diffusion Models

Ref-GS: Directional Factorization for 2D Gaussian Splatting

NeISF++: Neural Incident Stokes Field for Polarized Inverse Rendering of Conductors and Dielectrics

FluidNexus: 3D Fluid Reconstruction and Prediction from a Single Video

Uni-Renderer: Unifying Rendering and Inverse Rendering Via Dual Stream Diffusion

Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion

RNG: Relightable Neural Gaussians

SGSST: Scaling Gaussian Splatting Style Transfer

Vid2Sim: Generalizable, Video-based Reconstruction of Appearance, Geometry and Physics for Mesh-free Simulation

Material Anything: Generating Materials for Any 3D Object via Diffusion

TexGarment: Consistent Garment UV Texture Generation via Efficient 3D Structure-Guided Diffusion Transformer

3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion

BrepGiff: Lightweight Generation of Complex B-rep with 3D GAT Diffusion

Towards Realistic Example-based Modeling via 3D Gaussian Stitching

TreeMeshGPT: Artistic Mesh Generation with Autoregressive Tree Sequencing

GenVDM: Generating Vector Displacement Maps From a Single Image

CTRL-D: Controllable Dynamic 3D Scene Editing with Personalized 2D Diffusion

LeanGaussian: Breaking Pixel or Point Cloud Correspondence in Modeling 3D Gaussians

FlashGS: Efficient 3D Gaussian Splatting for Large-scale and High-resolution Rendering

Steepest Descent Density Control for Compact 3D Gaussian Splatting

GaussianSpa: An “Optimizing-Sparsifying” Simplification Framework for Compact and High-Quality 3D Gaussian Splatting

Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction

IMFine: 3D Inpainting via Geometry-guided Multi-view Refinement

3D Gaussian Inpainting with Depth-Guided Cross-View Consistency

CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

HoGS: Unified Near and Far Object Reconstruction via Homogeneous Gaussian Splatting

Scene4U: Hierarchical Layered 3D Scene Reconstruction from Single Panoramic Image for Your Immerse Exploration

Learning Partonomic 3D Reconstruction from Image Collections

DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models

Generative Sparse-View Gaussian Splatting

Novel View Synthesis with Pixel-Space Diffusion Models

MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes

CoMapGS: Covisibility Map-based Gaussian Splatting for Sparse Novel View Synthesis

Horizon-GS: Unified 3D Gaussian Splatting for Large-Scale Aerial-to-Ground Scenes

NexusGS: Sparse View Synthesis with Epipolar Depth Priors in 3D Gaussian Splatting

SPARS3R: Semantic Prior Alignment and Regularization for Sparse 3D Reconstruction

StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation

PMNI: Pose-free Multi-view Normal Integration for Reflective and Textureless Surface Reconstruction

Learnable Infinite Taylor Gaussian for Dynamic View Rendering

Efficient Dynamic Scene Editing via 4D Gaussian-based Static-Dynamic Separation

SplineGS: Robust Motion-Adaptive Spline for Real-Time Dynamic 3D Gaussians from Monocular Video

EventSplat: 3D Gaussian Splatting from Moving Event Cameras for Real-time Rendering

SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

Denoising Functional Maps: Diffusion Models for Shape Correspondence

Event Fields: Capturing Light Fields at High Speed, Resolution, and Dynamic Range

4DTAM: Non-Rigid Tracking and Mapping via Dynamic Surface Gaussians

IncEventGS: Pose-Free Gaussian Splatting from a Single Event Camera

Completion as Enhancement: A Degradation-Aware Selective Image Guided Network for Depth Completion

Blurred LiDAR for Sharper 3D: Robust Handheld 3D Scanning with Diffuse LiDAR and RGB

Focal Split: Untethered Snapshot Depth from Differential Defocus

HELVIPAD: A Real-World Dataset for Omnidirectional Stereo Depth Estimation

OFER: Occluded Face Expression Reconstruction

Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera

Order-One Rolling Shutter Cameras

Matrix-Free Shared Intrinsics Bundle Adjustment

Towards In-the-wild 3D Plane Reconstruction from a Single Image

Learning Affine Correspondences by Integrating Geometric Constraints

DiskVPS: Vanishing Point Detector via Hough Transform in a Disk Region

From Sparse to Dense: Camera Relocalization with Scene-Specific Detector from Feature Gaussian Splatting

RUBIK: A Structured Benchmark for Image Matching across Geometric Challenges

MATCHA: Towards Matching Anything

Scene-agnostic Pose Regression for Visual Localization

Simulator HC: Regression-based Online Simulation of Starting Problem-Solution Pairs for Homotopy Continuation in Geometric Vision

GaussianUDF: Inferring Unsigned Distance Functions through 3D Gaussian Splatting

ProbPose: A Probabilistic Approach to 2D Human Pose Estimation

Floating No More: Object-Ground Reconstruction from a Single Image

ArticulatedGS: Self-supervised Digital Twin Modeling of Articulated Objects using 3D Gaussian Splatting

GCE-Pose: Global Context Enhancement for Category-level Object Pose Estimation

Doppelgangers++: Improved Visual Disambiguation with Geometric 3D Features

MITracker: Multi-View Integration for Visual Object Tracking

ETAP: Event-based Tracking of Any Point

Ev-3DOD: Pushing the Temporal Boundaries of 3D Object Detection with Event Cameras

GO-N3RDet: Geometry Optimized NeRF-enhanced 3D Object Detector

Preconditioners for the Stochastic Training of Neural Fields

3D-SLNR: A Super Lightweight Neural Representation for Large-scale 3D Mapping

PCDreamer: Point Cloud Completion Through Multi-view Diffusion Priors

STAR-Edge: Structure-aware Local Spherical Curve Representation for Thin-walled Edge Extraction from Unstructured Point Clouds

DV-Matcher: Deformation-based Non-rigid Point Cloud Matching Guided by Pre-trained Visual Features

Mitigating Ambiguities in 3D Classification with Gaussian Splatting

Sparse Point Cloud Patches Rendering via Splitting 2D Gaussians

SASep: Saliency-Aware Structured Separation of Geometry and Feature for Open Set Learning on Point Clouds

TopNet: Transformer-Efficient Occupancy Prediction Network for Octree-Structured Point Cloud Geometry Compression

A Unified Approach to Interpreting Self-supervised Pre-training Methods for 3D Point Clouds via Interactions

An End-to-End Robust Point Cloud Semantic Segmentation Network with Single-Step Conditional Diffusion Models

PillarHist: A Quantization-aware Pillar Feature Encoder based on Height-aware Histogram

Deep Change Monitoring: A Hyperbolic Representative Learning Framework and a Dataset for Long-term Fine-grained Tree Change Detection

GBlobs: Explicit Local Structure via Gaussian Blobs for Improved Cross-Domain LiDAR-based 3D Object Detection

LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes

Exploring Scene Affinity for Semi-Supervised LiDAR Semantic Segmentation

V2X-R: Cooperative LiDAR-4D Radar Fusion with Denoising Diffusion for 3D Object Detection

Leveraging Temporal Cues for Semi-Supervised Multi-View 3D Object Detection

CorrBEV: Multi-View 3D Object Detection by Correlation Learning with Multi-modal Prototypes

CroCoDL: Cross-device Collaborative Dataset for Localization

ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions

RoboSense: Large-scale Dataset and Benchmark for Egocentric Robot Perception and Navigation in Crowded and Unstructured Environments

DIO: Decomposable Implicit 4D Occupancy-Flow World Model

EvOcc: Accurate Semantic Occupancy for Automated Driving Using Evidence Theory

GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction

SplatFlow: Self-Supervised Dynamic Gaussian Splatting in Neural Motion Flow Field for Autonomous Driving

DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation

GLane3D: Detecting Lanes with Graph of 3D Keypoints

UrbanCAD: Towards Highly Controllable and Photorealistic 3D Vehicles for Urban Scene Simulation

DrivingSphere: Building a High-fidelity 4D World for Closed-loop Simulation

Causal Composition Diffusion Model for Closed-loop Traffic Generation

Towards Autonomous Micromobility through Scalable Urban Simulation

Towards Generalizable Trajectory Prediction using Dual-Level Representation Learning and Adaptive Prompting

Distilling Multi-modal Large Language Models for Autonomous Driving

RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation

Exploration-Driven Generative Interactive Environments

Neural Motion Simulator Pushing the Limit of World Models in Reinforcement Learning

Reasoning Mamba: Hypergraph-Guided Region Relation Calculating for Weakly Supervised Affordance Grounding

AutoURDF: Unsupervised Robot Modeling from Point Cloud Frames Using Cluster Registration

Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation

RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins

VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation

Learning Physics-Based Full-Body Human Reaching and Grasping from Brief Walking References

TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation

BimArt: A Unified Approach for the Synthesis of 3D Bimanual Interaction with Articulated Objects

End-to-End HOI Reconstruction Transformer with Graph-based Encoding

Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera

EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision

PI-HMR: Towards Robust In-bed Temporal Human Shape Reconstruction with Contact Pressure Sensing

MVDoppler-Pose: Multi-Modal Multi-View mmWave Sensing for Long-Distance Self-Occluded Human Walking Pose Estimation

MotionPRO: Exploring the Role of Pressure in Human MoCap and Beyond

MODA: Motion-Drift Augmentation for Inertial Human Motion Analysis

Homogeneous Dynamics Space for Heterogeneous Humans

Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks

UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

Symbolic Representation for Any-to-Any Generative Tasks

SimMotionEdit: Text-Based Human Motion Editing with Motion Similarity Prediction

AnyMoLe: Any Character Motion In-betweening Leveraging Video Diffusion Models

MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities

Rethinking Diffusion for Text-Driven Human Motion Generation: Redundant Representations, Evaluation, and Masked Autoregression

ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model

Multiple Object Tracking as ID Prediction

Shape and Texture: What Influences Reliable Optical Flow Estimation?

Bridge Frame and Event: Common Spatiotemporal Fusion for High-Dynamic Scene Optical Flow

Unified Reconstruction of Static and Dynamic Scenes from Events

Learning Physics From Video: Unsupervised Physical Parameter Estimation for Continuous Dynamical Systems

Generating 3D-Consistent Videos from Unposed Internet Photos

AnimateAnything: Consistent and Controllable Animation for Video Generation

MotionPro: A Precise Motion Controller for Image-to-Video Generation

Generative Inbetweening through Frame-wise Conditions-Driven Video Generation

FreePCA: Integrating Consistency Information across Long-short Frames in Training-free Long Video Generation via Principal Component Analysis

Probability Density Geodesics in Image Diffusion Latent Space

Bias for Action: Video Implicit Neural Representations with Bias Modulation

BF-STVSR: B-Splines and Fourier---Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution

FLAVC: Learned Video Compression with Feature Level Attention

ProReflow: Progressive Reflow with Decomposed Velocity

Making Old Film Great Again: Degradation-aware State Space Model for Old Film Restoration

Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content

A Polarization-Aided Transformer for Image Deblurring via Motion Vector Decomposition

Satellite Observations Guided Diffusion Model for Accurate Meteorological States at Arbitrary Resolution

Automatic Spectral Calibration of Hyperspectral Images: Method, Dataset and Benchmark

VolFormer: Explore More Comprehensive Cube Interaction for Hyperspectral Image Restoration and Beyond

One Model for ALL: Low-Level Task Interaction Is a Key to Task-Agnostic Image Fusion

Continuous Adverse Weather Removal via Degradation-Aware Distillation

MambaIRv2: Attentive State Space Restoration

TSP-Mamba: The Travelling Salesman Problem Meets Mamba for Image Super-resolution and Beyond

Semi-Supervised State-Space Model with Dynamic Stacking Filter for Real-World Video Deraining

GenDeg: Diffusion-based Degradation Synthesis for Generalizable All-In-One Image Restoration

Generalized Recorrupted-to-Recorrupted: Self-Supervised Learning Beyond Gaussian Noise

Degradation-Aware Feature Perturbation for All-in-One Image Restoration

Exploring Semantic Feature Discrimination for Perceptual Image Super-Resolution and Opinion-Unaware No-Reference Image Quality Assessment

FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution

DEAL: Data-Efficient Adversarial Learning for High-Quality Infrared Imaging

Adversarial Diffusion Compression for Real-World Image Super-Resolution

All-Optical Nonlinear Diffractive Deep Network for Ultrafast Image Denoising

Deterministic Image-to-Image Translation via Denoising Brownian Bridge Models with Dual Approximators

Towards Smart Point-and-Shoot Photography

MetaShadow: Object-Centered Shadow Detection, Removal, and Synthesis

Erasing Undesirable Influence in Diffusion Models

EntityErasure: Erasing Entity Cleanly via Amodal Entity Segmentation and Completion

ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On

Latent Space Imaging

Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers

FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute

Consistency Posterior Sampling for Diverse Image Synthesis

Domain Adaptive Diabetic Retinopathy Grading with Model Absence and Flowing Data

Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment

SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE

Sketch Down the FLOPs: Towards Efficient Networks for Human Sketch

FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations

ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models

AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea

VIRES: Video Instance Repainting via Sketch and Text Guided Generation

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing

PICD: Versatile Perceptual Image Compression with Diffusion Rendering

Color Alignment in Diffusion

Geometry in Style: 3D Stylization via Surface Normal Deformation

SaMam: Style-aware State Space Model for Arbitrary Image Style Transfer

Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing

h-Edit: Effective and Flexible Diffusion-Based Editing via Doob's h-Transform

Concept Lancet: Image Editing with Compositional Representation Transplant

Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning

GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing

DreamOmni: Unified Image Generation and Editing

Black Hole-Driven Identity Absorbing in Diffusion Models

DreamText: High Fidelity Scene Text Synthesis

Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attack on Breast Ultrasound Images

A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation

Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget

Enhancing Creative Generation on Stable Diffusion-based Models

APT: Adaptive Personalized Training for Diffusion Models with Limited Data

InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment

STEPS: Sequential Probability Tensor Estimation for Text-to-Image Hard Prompt Search

PQPP: A Joint Benchmark for Text-to-Image Prompt and Query Performance Prediction

Let's Verify and Reinforce Image Generation Step by Step

GLASS: Guided Latent Slot Diffusion for Object-Centric Learning

DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

POSTA: A Go-to Framework for Customized Artistic Poster Generation

StageDesigner: Artistic Stage Generation for Scenography via Theater Scripts

Pattern Analogies: Learning to Perform Programmatic Image Edits by Analogy

Text-Driven Fashion Image Editing with Compositional Concept Learning and Counterfactual Abduction

Controllable Human Image Generation with Personalized Multi-Garments

AIM-Fair: Advancing Algorithmic Fairness via Selectively Fine-Tuning Biased Models with Contextual Synthetic Data

Precise, Fast, and Low-cost Concept Erasure in Value Space: Orthogonal Complement Matters

Six-CD: Benchmarking Concept Removals for Text-to-image Diffusion Models

Implicit Bias Injection Attacks against Text-to-Image Diffusion Models

Are Images Indistinguishable to Humans Also Indistinguishable to Classifiers?

Nearly Zero-Cost Protection Against Mimicry by Personalized Diffusion Models

Fingerprinting Denoising Diffusion Probabilistic Models

Where's the Liability in the Generative Era? Recovery-based Black-Box Detection of AI-Generated Content

SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model

Be More Specific: Evaluating Object-centric Realism in Synthetic Images

NSD-Imagery: A Benchmark Dataset for Extending fMRI Vision Decoding Methods to Mental Imagery

GG-SSMs: Graph-Generating State Space Models

Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders

EgoLife: Towards Egocentric Life Assistant

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows

Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment

Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval

SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes

Sound Bridge: Associating Egocentric and Exocentric Videos via Audio Cues

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation

SEAL: Semantic Attention Learning for Long Video Representation

Unified Dense Prediction of Video Diffusion

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation

Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding

Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

ReSpec: Relevance and Specificity Grounded Online Filtering for Learning on Video-Text Data Streams

Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks

VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models

Flexible Frame Selection for Efficient Video Reasoning

LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering

SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding

Adaptive Keyframe Sampling for Long Video Understanding

Efficient Transfer Learning for Video-language Foundation Models

EventGPT: Event Stream Understanding with Multimodal Large Language Models

HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

DiffVsgg: Diffusion-Driven Online Video Scene Graph Generation

CASAGPT: Cuboid Arrangement and Scene Assembly for Interior Design

The Devil is in Temporal Token: High Quality Video Reasoning Segmentation

M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation

Anomize: Better Open Vocabulary Video Anomaly Detection

UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines

Temporal Action Detection Model Compression by Progressive Block Drop

Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model

Revealing Key Details to See Differences: A Novel Prototypical Perspective for Skeleton-based Action Recognition

DiSciPLE: Learning Interpretable Programs for Scientific Visual Discovery

Divide and Conquer: Heterogeneous Noise Integration for Diffusion-based Adversarial Purification

SDBF: Steep-Decision-Boundary Fingerprinting for Hard-Label Tampering Detection of DNN Models

From Head to Tail: Efficient Black-box Model Inversion Attack via Long-tailed Learning

UMFN: Unified Multi-Domain Face Normalization for Joint Cross-domain Prototype Learning and Heterogeneous Face Recognition

MEET: Towards Memory-Efficient Temporal Sparse Deep Neural Networks

Object Detection using Event Camera: A MoE Heat Conduction based Detector and A New Benchmark Dataset

Person De-reidentification: A Variation-guided Identity Shift Modeling

WISE: A Framework for Gigapixel Whole-Slide-Image Lossless Compression

BOE-ViT: Boosting Orientation Estimation with Equivariance in Self-Supervised 3D Subtomogram Alignment

Point-to-Region Loss for Semi-Supervised Point-Based Crowd Counting

SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts

Segment Anything, Even Occluded

BFANet: Revisiting 3D Semantic Segmentation with Boundary Feature Analysis

SCSegamba: Lightweight Structure-Aware Vision Mamba for Crack Segmentation in Structures

Towards Continual Universal Segmentation

Segment This Thing: Foveated Tokenization for Efficient Point-Prompted Segmentation

Probabilistic Prompt Distribution Learning for Animal Pose Estimation

Navigating the Unseen: Zero-shot Scene Graph Generation via Capsule-Based Equivariant Features

ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis

LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models

Vision-Language Embodiment for Monocular Depth Estimation

SpiritSight Agent: Advanced GUI Agent with One Look

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

Collaborative Tree Search for Enhancing Embodied Multi-Agent Collaboration

CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

VLMs-Guided Representation Distillation for Efficient Vision-Based Reinforcement Learning

VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models

Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

CocoER: Aligning Multi-Level Feature by Competition and Coordination for Emotion Recognition

LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models

Seek Common Ground While Reserving Differences: Semi-Supervised Image-Text Sentiment Recognition

Vision-Language Models Do Not Understand Negation

Alignment, Mining and Fusion: Representation Alignment with Hard Negative Mining and Selective Knowledge Fusion for Medical Visual Question Answering

Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image Segmentation

Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens

UNIALIGN: Scaling Multimodal Alignment within One Unified Model

SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language

Semantic and Expressive Variations in Image Captions Across Languages

ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Generative Zero-Shot Composed Image Retrieval

IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification

MP-GUI: Modality Perception with MLLMs for GUI Understanding

Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark

Incorporating Dense Knowledge Alignment into Unified Multimodal Representation Models

MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations

Non-Natural Image Understanding with Advancing Frequency-based Vision Encoders

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

Language-Guided Salient Object Ranking

HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models

Mimic In-Context Learning for Multimodal Tasks

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment

Towards Understanding How Knowledge Evolves in Large Vision-Language Models

Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

HalLoc: Token-level Localization of Hallucinations for Vision Language Models

Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding

Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models

Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy

Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks

R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations

Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages

Bayesian Test-Time Adaptation for Vision-Language Models

Cropper: Vision-Language Model for Image Cropping through In-Context Learning

ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning

SCAP: Transductive Test-Time Adaptation via Supportive Clique-based Attribute Prompting

Interpreting Object-level Foundation Models via Visual Precision Search

Towards Fine-Grained Interpretability: Counterfactual Explanations for Misclassification with Saliency Partition

Show and Tell: Visually Explainable Deep Neural Nets via Spatially-Aware Concept Bottleneck Models

VL2Lite: Task-Specific Knowledge Distillation from Large Vision-Language Models to Lightweight Networks

DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers

Probing the Mid-level Vision Capabilities of Self-Supervised Learning

Parameter Efficient Mamba Tuning via Projector-targeted Diagonal-centric Linear Transformation

COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection

Sample- and Parameter-Efficient Auto-Regressive Image Models

Subnet-Aware Dynamic Supernet Training for Neural Architecture Search

DeepCompress-ViT: Rethinking Model Compression to Enhance Efficiency of Vision Transformers at the Edge

Effective SAM Combination for Open-Vocabulary Semantic Segmentation

Adventurer: Optimizing Vision Mamba Architecture Designs for Efficiency

Dynamic Group Normalization: Spatio-Temporal Adaptation to Evolving Data Statistics

Frequency Dynamic Convolution for Dense Image Prediction

Faster Parameter-Efficient Tuning with Token Redundancy Reduction

Discovering Fine-Grained Visual-Concept Relations by Disentangled Optimal Transport Concept Bottleneck Models

TIDE: Training Locally Interpretable Domain Generalization Models Enables Test-time Correction

Exploring Simple Open-Vocabulary Semantic Segmentation

POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation

Multi-Label Prototype Visual Spatial Search for Weakly Supervised Semantic Segmentation

HistoFS: Non-IID Histopathologic Whole Slide Image Classification via Federated Style Transfer with RoI-Preserving

FFR: Frequency Feature Rectification for Weakly Supervised Semantic Segmentation

Prototype-Based Image Prompting for Weakly Supervised Histopathological Image Segmentation

Pay Attention to the Foreground in Object-Centric Learning

Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability

LOGICZSL: Exploring Logic-induced Representation for Compositional Zero-shot Learning

CLIP-driven Coarse-to-fine Semantic Guidance for Fine-grained Open-set Semi-supervised Learning

Less Attention is More: Prompt Transformer for Generalized Category Discovery

Open-World Objectness Modeling Unifies Novel Object Detection

Activating Sparse Part Concepts for 3D Class Incremental Learning

Learning Endogenous Attention for Incremental Object Detection

UCOD-DPL: Unsupervised Camouflaged Object Detection via Dynamic Pseudo-label Learning

Feature Information Driven Position Gaussian Distribution Estimation for Tiny Object Detection

A Unified, Resilient, and Explainable Adversarial Patch Detector

Bayesian Prompt Flow Learning for Zero-Shot Anomaly Detection

Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection

Dual-Interrelated Diffusion Model for Few-Shot Anomaly Image Generation

LotusFilter: Fast Diverse Nearest Neighbor Search via a Learned Cutoff Table

FedBiP: Heterogeneous One-Shot Federated Learning with Personalized Latent Diffusion Models

Emphasizing Discriminative Features for Dataset Distillation in Complex Scenarios

Hierarchical Features Matter: A Deep Exploration of Progressive Parameterization Method for Dataset Distillation

EVOS: Efficient Implicit Neural Training via EVOlutionary Selector

Learning from Neighbors: Category Extrapolation for Long-Tail Learning

PLeaS - Merging Models with Permutations and Least Squares

Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment

SURGEON: Memory-Adaptive Fully Test-Time Adaptation via Dynamic Activation Sparsity

Hierarchical Knowledge Prompt Tuning for Multi-task Test-Time Adaptation

CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning

Dynamic Integration of Task-Specific Adapters for Class Incremental Learning

Task-Specific Gradient Adaptation for Few-Shot One-Class Classification

Multi-Granularity Class Prototype Topology Distillation for Class-Incremental Source-Free Unsupervised Domain Adaptation

Balanced Direction from Multifarious Choices: Arithmetic Meta-Learning for Domain Generalization

ADU: Adaptive Detection of Unknown Categories in Black-Box Domain Adaptation

Unlocking the Potential of Unlabeled Data in Semi-Supervised Domain Generalization

Distilling Long-tailed Datasets

Open Set Label Shift with Test Time Out-of-Distribution Reference

OODD: Test-time Out-of-Distribution Detection with Dynamic Dictionary

pFedMxF: Personalized Federated Class-Incremental Learning with Mixture of Frequency Aggregation

FedAWA: Adaptive Optimization of Aggregation Weights in Federated Learning Using Client Vectors

Unlearning through Knowledge Overwriting: Reversible Federated Unlearning via Selective Sparse Adapter

Jailbreaking the Non-Transferable Barrier via Test-Time Data Disguising

Improving the Training of Data-Efficient GANs via Quality Aware Dynamic Discriminator Rejection Sampling

EntropyMark: Towards More Harmless Backdoor Watermark via Entropy-based Constraint for Open-source Dataset Copyright Protection

Towards Million-Scale Adversarial Robustness Evaluation With Stronger Individual Attacks

Balancing Two Classifiers via A Simplex ETF Structure for Model Calibration

Incomplete Multi-View Multi-label Learning via Disentangled Representation and Label Semantic Embedding

ROLL: Robust Noisy Pseudo-label Learning for Multi-View Clustering with Noisy Correspondence

Feature Selection for Latent Factor Models

Multi-modal Contrastive Learning with Negative Sampling Calibration for Phenotypic Drug Discovery

Multi-modal Medical Diagnosis via Large-small Model Collaboration

Towards All-in-One Medical Image Re-Identification

FactCheXcker: Mitigating Measurement Hallucinations in Chest X-ray Report Generation Models

Interactive Medical Image Analysis with Concept-based Similarity Reasoning

Unsupervised Foundation Model-Agnostic Slide-Level Representation Learning

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning

ASIGN: An Anatomy-aware Spatial Imputation Graphic Network for 3D Spatial Transcriptomics

beta-FFT: Nonlinear Interpolation and Differentiated Training Strategies for Semi-Supervised Medical Image Segmentation

DyCON: Dynamic Uncertainty-aware Consistency and Contrastive Learning for Semi-supervised Medical Image Segmentation

Rethinking Decoder Design: Improving Biomarker Segmentation Using Depth-to-Space Restoration and Residual Linear Attention

LesionLocator: Zero-Shot Universal Tumor Segmentation and Tracking in 3D Whole-Body Imaging

DAMM-Diffusion: Learning Divergence-Aware Multi-Modal Diffusion Model for Nanoparticles Distribution Prediction

DeformCL: Learning Deformable Centerline Representation for Vessel Extraction in 3D Medical Image

MultiMorph: On-demand Atlas Construction

Anatomical Consistency and Adaptive Prior-informed Transformation for Multi-contrast MR Image Synthesis via Diffusion Model

CrossSDF: 3D Reconstruction of Thin Structures From Cross-Sections

(ends 6:00 PM)