Show Detail |
Timezone: America/Chicago
|
Filter Rooms:
FRI 13 JUN
9 a.m.
Orals 9:00-10:15
[9:00]
Motion Prompting: Controlling Video Generation with Motion Trajectories
[9:15]
Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise
[9:30]
LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping
[9:45]
Alias-Free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space
[10:00]
RandAR: Decoder-only Autoregressive Visual Generation in Random Orders
(ends 10:15 AM)
Orals 9:00-10:15
[9:00]
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
[9:15]
LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions
[9:30]
Do We Always Need the Simplicity Bias? Looking for Optimal Inductive Biases in the Wild
[9:45]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
[10:00]
Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector
(ends 10:15 AM)
Orals 9:00-10:15
[9:00]
CleanDIFT: Diffusion Features without Noise
[9:15]
OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels
[9:30]
Towards Explicit Geometry-Reflectance Collaboration for Generalized LiDAR Segmentation in Adverse Weather
[9:45]
DiffFNO: Diffusion Fourier Neural Operator
[10:00]
Removing Reflections from RAW Photos
(ends 10:15 AM)
10:30 a.m.
Posters 10:30-12:30
FreeUV: Ground-Truth-Free Realistic Facial UV Texture Recovery via Cross-Assembly Inference Strategy
Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction
VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding
(ends 12:30 PM)
1 p.m.
Orals 1:00-2:30
[1:00]
FoundationStereo: Zero-Shot Stereo Matching
[1:15]
MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision
[1:30]
Multi-view Reconstruction via SfM-guided Monocular Depth Estimation
[1:45]
MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds
[2:00]
VGGT: Visual Geometry Grounded Transformer
[2:15]
CraftsMan3D: High-fidelity Mesh Generation with 3D Native Diffusion and Interactive Geometry Refiner
(ends 2:30 PM)
Orals 1:00-2:30
[1:00]
CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models
[1:15]
Reanimating Images using Neural Representations of Dynamic Stimuli
[1:30]
EgoLM: Multi-Modal Language Model of Egocentric Motions
[1:45]
Reconstructing Humans with a Biomechanically Accurate Skeleton
[2:00]
MEGA: Masked Generative Autoencoder for Human Mesh Recovery
[2:15]
TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization
(ends 2:30 PM)
Orals 1:00-2:30
[1:00]
Descriptor-In-Pixel : Point-Feature Tracking For Pixel Processor Arrays
[1:15]
Temporally Consistent Object-Centric Learning by Contrasting Slots
[1:30]
Temporal Alignment-Free Video Matching for Few-shot Action Recognition
[1:45]
Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models
[2:00]
The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition
[2:15]
Rethinking Spiking Self-Attention Mechanism: Implementing α-XNOR Similarity Calculation in Spiking Transformers
(ends 2:30 PM)
2:45 p.m.
4 p.m.
Posters 4:00-6:00
HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion
CraftsMan3D: High-fidelity Mesh Generation with 3D Native Diffusion and Interactive Geometry Refiner
Common3D: Self-Supervised Learning of 3D Morphable Models for Common Objects in Neural Feature Space
Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning
K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences
COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts
Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation
(ends 6:00 PM)
SAT 14 JUN
9 a.m.
Orals 9:00-10:15
[9:00]
MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos
[9:15]
Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos
[9:30]
Continuous 3D Perception Model with Persistent State
[9:45]
TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion
[10:00]
Neural Inverse Rendering from Propagating Light
(ends 10:15 AM)
Orals 9:00-10:15
[9:00]
SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images
[9:15]
Towards Universal Dataset Distillation via Task-Driven Diffusion
[9:30]
IceDiff: High Resolution and High-Quality Arctic Sea Ice Forecasting with Generative Diffusion Prior
[9:45]
Efficient Test-time Adaptive Object Detection via Sensitivity-Guided Pruning
[10:00]
Keep the Balance: A Parameter-Efficient Symmetrical Framework for RGB+X Semantic Segmentation
(ends 10:15 AM)
Orals 9:00-10:15
[9:00]
Identifying and Mitigating Position Bias of Multi-image Vision-Language Models
[9:15]
Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key
[9:30]
Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content
[9:45]
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
[10:00]
From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons
(ends 10:15 AM)
10:30 a.m.
Posters 10:30-12:30
HyperPose: Hypernetwork-Infused Camera Pose Localization and an Extended Cambridge Landmarks Dataset
IceDiff: High Resolution and High-Quality Arctic Sea Ice Forecasting with Generative Diffusion Prior
BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers
InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception
(ends 12:30 PM)
1 p.m.
Orals 1:00-2:15
[1:00]
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
[1:15]
Language-Guided Image Tokenization for Generation
[1:30]
DreamRelation: Bridging Customization and Relation Generation
[1:45]
Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
[2:00]
Autoregressive Distillation of Diffusion Transformers
(ends 2:15 PM)
Orals 1:00-2:15
[1:00]
PDFactor: Learning Tri-Perspective View Policy Diffusion Field for Multi-Task Robotic Manipulation
[1:15]
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
[1:30]
GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill
[1:45]
Navigation World Models
[2:00]
Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-invariant Representation Learning
(ends 2:15 PM)
Orals 1:00-2:15
[1:00]
DORNet: A Degradation Oriented and Regularized Network for Blind Depth Super-Resolution
[1:15]
Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World
[1:30]
Learned Binocular-Encoding Optics for RGBD Imaging Using Joint Stereo and Focus Cues
[1:45]
Camera Resection from Known Line Pencils and a Radially Distorted Scanline
[2:00]
Opportunistic Single-Photon Time of Flight
(ends 2:15 PM)
2:30 p.m.
5 p.m.
Posters 5:00-7:00
AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting
GPVK-VL: Geometry-Preserving Virtual Keyframes for Visual Localization under Large Viewpoint Changes
GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding
EnergyMoGen: Compositional Human Motion Generation with Energy-Based Diffusion Model in Latent Space
CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation
Do Computer Vision Foundation Models Learn the Low-level Characteristics of the Human Visual System?
(ends 7:00 PM)
SUN 15 JUN
9 a.m.
Orals 9:00-10:15
[9:00]
Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing
[9:15]
DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models
[9:30]
CustAny: Customizing Anything from A Single Example
[9:45]
Minority-Focused Text-to-Image Generation via Prompt Optimization
[10:00]
Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models
(ends 10:15 AM)
Orals 9:00-10:15
[9:00]
UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming
[9:15]
Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning
[9:30]
Enhancing Diversity for Data-free Quantization
[9:45]
TopoCellGen: Generating Histopathology Cell Topology with a Diffusion Model
[10:00]
Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation
(ends 10:15 AM)
Orals 9:00-10:15
[9:00]
Adv-CPG: A Customized Portrait Generation Framework with Facial Adversarial Attacks
[9:15]
Gromov–Wasserstein Problem with Cyclic Symmetry
[9:30]
Time of the Flight of the Gaussians: Optimizing Depth Indirectly in Dynamic Radiance Fields
[9:45]
Zero-Shot Monocular Scene Flow Estimation in the Wild
[10:00]
3D Student Splatting and Scooping
(ends 10:15 AM)
10:30 a.m.
Posters 10:30-12:30
GeoAvatar: Geometrically-Consistent Multi-Person Avatar Reconstruction from Sparse Multi-View Videos
Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing
(ends 12:30 PM)
1 p.m.
Orals 1:00-2:15
[1:00]
DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models
[1:15]
3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting
[1:30]
DNF: Unconditional 4D Generation with Dictionary-based Neural Fields
[1:45]
CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models
[2:00]
Diffusion Renderer: Neural Inverse and Forward Rendering with Video Diffusion Models
(ends 2:30 PM)
Orals 1:00-2:30
[1:00]
Effective SAM Combination for Open-Vocabulary Semantic Segmentation
[1:15]
FluidNexus: 3D Fluid Reconstruction and Prediction from a Single Video
[1:30]
Birth and Death of a Rose
[1:45]
Semi-Supervised State-Space Model with Dynamic Stacking Filter for Real-World Video Deraining
[2:00]
AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea
[2:15]
Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens
(ends 2:30 PM)
Orals 1:00-2:30
[1:00]
Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding
[1:15]
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
[1:30]
LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models
[1:45]
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
[2:00]
SEAL: Semantic Attention Learning for Long Video Representation
[2:15]
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval
(ends 2:30 PM)
2:45 p.m.
4 p.m.
Posters 4:00-6:00
Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion
IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification
(ends 6:00 PM)