Skip to yearly menu bar Skip to main content


Show Detail
Timezone: America/Chicago
 
Filter Rooms:  

FRI 13 JUN
9 a.m.
Orals 9:00-10:15
[9:00] Motion Prompting: Controlling Video Generation with Motion Trajectories
[9:15] Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise
[9:30] LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping
[9:45] Alias-Free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space
[10:00] RandAR: Decoder-only Autoregressive Visual Generation in Random Orders
(ends 10:15 AM)
Orals 9:00-10:15
[9:00] OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
[9:15] LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions
[9:30] Do We Always Need the Simplicity Bias? Looking for Optimal Inductive Biases in the Wild
[9:45] Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
[10:00] Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector
(ends 10:15 AM)
Orals 9:00-10:15
[9:00] CleanDIFT: Diffusion Features without Noise
[9:15] OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels
[9:30] Towards Explicit Geometry-Reflectance Collaboration for Generalized LiDAR Segmentation in Adverse Weather
[9:45] DiffFNO: Diffusion Fourier Neural Operator
[10:00] Removing Reflections from RAW Photos
(ends 10:15 AM)
10:30 a.m.
Posters 10:30-12:30
(ends 12:30 PM)
1 p.m.
Orals 1:00-2:30
[1:00] FoundationStereo: Zero-Shot Stereo Matching
[1:15] MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision
[1:30] Multi-view Reconstruction via SfM-guided Monocular Depth Estimation
[1:45] MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds
[2:00] VGGT: Visual Geometry Grounded Transformer
[2:15] CraftsMan3D: High-fidelity Mesh Generation with 3D Native Diffusion and Interactive Geometry Refiner
(ends 2:30 PM)
Orals 1:00-2:30
[1:00] CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models
[1:15] Reanimating Images using Neural Representations of Dynamic Stimuli
[1:30] EgoLM: Multi-Modal Language Model of Egocentric Motions
[1:45] Reconstructing Humans with a Biomechanically Accurate Skeleton
[2:00] MEGA: Masked Generative Autoencoder for Human Mesh Recovery
[2:15] TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization
(ends 2:30 PM)
Orals 1:00-2:30
[1:00] Descriptor-In-Pixel : Point-Feature Tracking For Pixel Processor Arrays
[1:15] Temporally Consistent Object-Centric Learning by Contrasting Slots
[1:30] Temporal Alignment-Free Video Matching for Few-shot Action Recognition
[1:45] Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models
[2:00] The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition
[2:15] Rethinking Spiking Self-Attention Mechanism: Implementing α-XNOR Similarity Calculation in Spiking Transformers
(ends 2:30 PM)
2:45 p.m.
Keynote:
Harry Shum
(ends 3:45 PM)
4 p.m.
Posters 4:00-6:00
(ends 6:00 PM)

SAT 14 JUN
9 a.m.
Orals 9:00-10:15
[9:00] MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos
[9:15] Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos
[9:30] Continuous 3D Perception Model with Persistent State
[9:45] TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion
[10:00] Neural Inverse Rendering from Propagating Light
(ends 10:15 AM)
Orals 9:00-10:15
[9:00] SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images
[9:15] Towards Universal Dataset Distillation via Task-Driven Diffusion
[9:30] IceDiff: High Resolution and High-Quality Arctic Sea Ice Forecasting with Generative Diffusion Prior
[9:45] Efficient Test-time Adaptive Object Detection via Sensitivity-Guided Pruning
[10:00] Keep the Balance: A Parameter-Efficient Symmetrical Framework for RGB+X Semantic Segmentation
(ends 10:15 AM)
Orals 9:00-10:15
[9:00] Identifying and Mitigating Position Bias of Multi-image Vision-Language Models
[9:15] Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key
[9:30] Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content
[9:45] Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
[10:00] From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons
(ends 10:15 AM)
10:30 a.m.
Posters 10:30-12:30
(ends 12:30 PM)
1 p.m.
Orals 1:00-2:15
[1:00] Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
[1:15] Language-Guided Image Tokenization for Generation
[1:30] DreamRelation: Bridging Customization and Relation Generation
[1:45] Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
[2:00] Autoregressive Distillation of Diffusion Transformers
(ends 2:15 PM)
Orals 1:00-2:15
[1:00] PDFactor: Learning Tri-Perspective View Policy Diffusion Field for Multi-Task Robotic Manipulation
[1:15] RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
[1:30] GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill
[1:45] Navigation World Models
[2:00] Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-invariant Representation Learning
(ends 2:15 PM)
Orals 1:00-2:15
[1:00] DORNet: A Degradation Oriented and Regularized Network for Blind Depth Super-Resolution
[1:15] Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World
[1:30] Learned Binocular-Encoding Optics for RGBD Imaging Using Joint Stereo and Focus Cues
[1:45] Camera Resection from Known Line Pencils and a Radially Distorted Scanline
[2:00] Opportunistic Single-Photon Time of Flight
(ends 2:15 PM)
2:30 p.m.
Keynote:
Laurens Van der Maaten
(ends 3:30 PM)
5 p.m.
Posters 5:00-7:00
(ends 7:00 PM)

SUN 15 JUN
9 a.m.
Orals 9:00-10:15
[9:00] Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing
[9:15] DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models
[9:30] CustAny: Customizing Anything from A Single Example
[9:45] Minority-Focused Text-to-Image Generation via Prompt Optimization
[10:00] Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models
(ends 10:15 AM)
Orals 9:00-10:15
[9:00] UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming
[9:15] Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning
[9:30] Enhancing Diversity for Data-free Quantization
[9:45] TopoCellGen: Generating Histopathology Cell Topology with a Diffusion Model
[10:00] Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation
(ends 10:15 AM)
Orals 9:00-10:15
[9:00] Adv-CPG: A Customized Portrait Generation Framework with Facial Adversarial Attacks
[9:15] Gromov–Wasserstein Problem with Cyclic Symmetry
[9:30] Time of the Flight of the Gaussians: Optimizing Depth Indirectly in Dynamic Radiance Fields
[9:45] Zero-Shot Monocular Scene Flow Estimation in the Wild
[10:00] 3D Student Splatting and Scooping
(ends 10:15 AM)
10:30 a.m.
Posters 10:30-12:30
(ends 12:30 PM)
1 p.m.
Orals 1:00-2:15
[1:00] DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models
[1:15] 3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting
[1:30] DNF: Unconditional 4D Generation with Dictionary-based Neural Fields
[1:45] CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models
[2:00] Diffusion Renderer: Neural Inverse and Forward Rendering with Video Diffusion Models
(ends 2:30 PM)
Orals 1:00-2:30
[1:00] Effective SAM Combination for Open-Vocabulary Semantic Segmentation
[1:15] FluidNexus: 3D Fluid Reconstruction and Prediction from a Single Video
[1:30] Birth and Death of a Rose
[1:45] Semi-Supervised State-Space Model with Dynamic Stacking Filter for Real-World Video Deraining
[2:00] AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea
[2:15] Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens
(ends 2:30 PM)
Orals 1:00-2:30
[1:00] Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding
[1:15] Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
[1:30] LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models
[1:45] VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
[2:00] SEAL: Semantic Attention Learning for Long Video Representation
[2:15] Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval
(ends 2:30 PM)
2:45 p.m.
Keynote:
Carolina Parada
(ends 3:45 PM)
4 p.m.
Posters 4:00-6:00
(ends 6:00 PM)