CVPR 2025 Events with Videos
Keynotes
Posters
- ParaHome: Parameterizing Everyday Home Activities Towards 3D Generative Modeling of Human-Object Interactions
- PrEditor3D: Fast and Precise 3D Shape Editing
- VinaBench: Benchmark for Faithful and Consistent Visual Narratives
- Test-Time Fine-Tuning of Image Compression Models for Multi-Task Adaptability
- LT3SD: Latent Trees for 3D Scene Diffusion
- SynthLight: Portrait Relighting with Diffusion Model by Learning to Re-render Synthetic Faces
- Point-Cache: Test-time Dynamic and Hierarchical Cache for Robust and Generalizable Point Cloud Analysis
- DexHandDiff: Interaction-aware Diffusion Planning for Adaptive Dexterous Manipulation
- EZSR: Event-based Zero-Shot Recognition
- HiMoR: Monocular Deformable Gaussian Reconstruction with Hierarchical Motion Representation
- Memories of Forgotten Concepts
- Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observations
- Camouflage Anything: Learning to Hide using Controlled Out-painting and Representation Engineering
- Synthetic Data is an Elegant GIFT for Continual Vision-Language Models
- Polarized Color Screen Matting
- Improve Representation for Imbalanced Regression through Geometric Constraints
- Differentiable Inverse Rendering with Interpretable Basis BRDFs
- Prior-free 3D Object Tracking
- Balanced Rate-Distortion Optimization in Learned Image Compression
- FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering
- Motion Modes: What Could Happen Next?
- Simplification Is All You Need against Out-of-Distribution Overconfidence
- Query Efficient Black-Box Visual Prompting with Subspace Learning
- Context-Aware Multimodal Pretraining
- ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration
- Binarized Neural Network for Multi-spectral Image Fusion
- Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking
- Attention IoU: Examining Biases in CelebA using Attention Maps
- VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation
- DeepLA-Net: Very Deep Local Aggregation Networks for Point Cloud Analysis
- MambaVO: Deep Visual Odometry Based on Sequential Matching Refinement and Training Smoothing
- KVQ: Boosting Video Quality Assessment via Saliency-guided Local Perception
- SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation
- Fortifying Federated Learning Towards Trustworthiness via Auditable Data Valuation and Verifiable Client Contribution
- MAD: Memory-Augmented Detection of 3D Objects
- Seeing More with Less: Human-like Representations in Vision Models
- One Diffusion to Generate Them All
- Electromyography-Informed Facial Expression Reconstruction for Physiological-Based Synthesis and Analysis
- iSegMan: Interactive Segment-and-Manipulate 3D Gaussians
- GASP: Gaussian Avatars with Synthetic Priors
- PreciseCam: Precise Camera Control for Text-to-Image Generation
- HumanMM: Global Human Motion Recovery from Multi-shot Videos
- Quaffure: Real-Time Quasi-Static Neural Hair Simulation
- PEACE: Empowering Geologic Map Holistic Understanding with MLLMs
- FinePhys: Fine-grained Human Action Generation by Explicitly Incorporating Physical Laws for Effective Skeletal Guidance
- Illumination Spectrum Estimation for Multispectral Images via Surface Reflectance Modeling and Spatial-Spectral Feature Generation
- Towards Source-Free Machine Unlearning
- RELOCATE: A Simple Training-Free Baseline for Visual Query Localization Using Region-Based Representations
- TexGaussian: Generating High-quality PBR Material via Octree-based 3D Gaussian Splatting
- SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model
- EquiPose: Exploiting Permutation Equivariance for Relative Camera Pose Estimation
- PanDA: Towards Panoramic Depth Anything with Unlabeled Panoramas and Mobius Spatial Augmentation
- BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices
- CaMuViD: Calibration-Free Multi-View Detection
- ChatGarment: Garment Estimation, Generation and Editing via Large Language Models
- Image Generation Diversity Issues and How to Tame Them
- GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks
- Joint Vision-Language Social Bias Removal for CLIP
- AnomalyNCD: Towards Novel Anomaly Class Discovery in Industrial Scenarios
- LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
- AI-Face: A Million-Scale Demographically Annotated AI-Generated Face Dataset and Fairness Benchmark
- Augmenting Perceptual Super-Resolution via Image Quality Predictors
- FDS: Frequency-Aware Denoising Score for Text-Guided Latent Diffusion Image Editing
- Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation
- FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training
- CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models
- Adaptive Non-Uniform Timestep Sampling for Accelerating Diffusion Model Training
- Neural Hierarchical Decomposition for Single Image Plant Modeling
- MeshArt: Generating Articulated Meshes with Structure-Guided Transformers
- Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation
- Decoupling Fine Detail and Global Geometry for Compressed Depth Map Super-Resolution
- NTR-Gaussian: Nighttime Dynamic Thermal Reconstruction with 4D Gaussian Splatting Based on Thermodynamics
- Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning
- EnvPoser: Environment-aware Realistic Human Motion Estimation from Sparse Observations with Uncertainty Modeling
- Blurry-Edges: Photon-Limited Depth Estimation from Defocused Boundaries
- HybridGS: Decoupling Transients and Statics with 2D and 3D Gaussian Splatting
- Conditional Balance: Improving Multi-Conditioning Trade-Offs in Image Generation
- Stochastic Human Motion Prediction with Memory of Action Transition and Action Characteristic
- SGC-Net: Stratified Granular Comparison Network for Open-Vocabulary HOI Detection
- VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
- SimLTD: Simple Supervised and Semi-Supervised Long-Tailed Object Detection
- Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification
- AG-VPReID: A Challenging Large-Scale Benchmark for Aerial-Ground Video-based Person Re-Identification
- STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding
- Towards Satellite Image Road Graph Extraction: A Global-Scale Dataset and A Novel Method
- SAMBLE: Shape-Specific Point Cloud Sampling for an Optimal Trade-Off Between Local Detail and Global Uniformity
- Cross-Modal 3D Representation with Multi-View Images and Point Clouds
- GraphMimic: Graph-to-Graphs Generative Modeling from Videos for Policy Learning
- Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
- Nonisotropic Gaussian Diffusion for Realistic 3D Human Motion Prediction
- VASparse: Towards Efficient Visual Hallucination Mitigation via Visual-Aware Token Sparsification
- High Temporal Consistency through Semantic Similarity Propagation in Semi-Supervised Video Semantic Segmentation for Autonomous Flight
- Deep Fair Multi-View Clustering with Attention KAN
- Learning Extremely High Density Crowds as Active Matters
- RipVIS: Rip Currents Video Instance Segmentation Benchmark for Beach Monitoring and Safety
- Dual Diffusion for Unified Image Generation and Understanding
- Scaling up Image Segmentation across Data and Tasks
- Curriculum Direct Preference Optimization for Diffusion and Consistency Models
- EnliveningGS: Active Locomotion of 3DGS
- HotSpot: Signed Distance Function Optimization with an Asymptotically Sufficient Condition
- Compass Control: Multi Object Orientation Control for Text-to-Image Generation
- SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration
- Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction
- Associative Transformer
- Zero-Shot Image Restoration Using Few-Step Guidance of Consistency Models (and Beyond)
- MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D
- PICO: Reconstructing 3D People In Contact with Objects
- MUSt3R: Multi-view Network for Stereo 3D Reconstruction
- SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models
- Continuous Space-Time Video Resampling with Invertible Motion Steganography
- LineArt: A Knowledge-guided Training-free High-quality Appearance Transfer for Design Drawing with Diffusion Model
- Structure-from-Motion with a Non-Parametric Camera Model
- Towards Improved Text-Aligned Codebook Learning: Multi-Hierarchical Codebook-Text Alignment with Long Text
- Accurate Differential Operators for Hybrid Neural Fields
- InsightEdit: Towards Better Instruction Following for Image Editing
- PromptHMR: Promptable Human Mesh Recovery
- A New Statistical Model of Star Speckles for Learning to Detect and Characterize Exoplanets in Direct Imaging Observations
- SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation
- Towards High-fidelity 3D Talking Avatar with Personalized Dynamic Texture
- VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide
- Towards Explicit Geometry-Reflectance Collaboration for Generalized LiDAR Segmentation in Adverse Weather
- Large Self-Supervised Models Bridge the Gap in Domain Adaptive Object Detection
- Gaussian Splatting Feature Fields for (Privacy-Preserving) Visual Localization
- Hardware-Rasterized Ray-Based Gaussian Splatting
- PIAD: Pose and Illumination agnostic Anomaly Detection
- AeSPa : Attention-guided Self-supervised Parallel Imaging for MRI Reconstruction
- Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding
- SinGS: Animatable Single-Image Human Gaussian Splats with Kinematic Priors
- SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding
- Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification
- Efficient Personalization of Quantized Diffusion Model without Backpropagation
- Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection
- AdaCM^2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction
- Multitwine: Multi-Object Compositing with Text and Layout Control
- ProtoDepth: Unsupervised Continual Depth Completion with Prototypes
- Robust Multimodal Survival Prediction with Conditional Latent Differentiation Variational AutoEncoder
- Monocular and Generalizable Gaussian Talking Head Animation
- Enhancing Facial Privacy Protection via Weakening Diffusion Purification
- MonSter: Marry Monodepth to Stereo Unleashes Power
- Open-World Amodal Appearance Completion
- VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models
- HOTFormerLoc: Hierarchical Octree Transformer for Versatile Lidar Place Recognition Across Ground and Aerial Views
- Enhancing Privacy-Utility Trade-offs to Mitigate Memorization in Diffusion Models
- Stable Flow: Vital Layers for Training-Free Image Editing
- VGGT: Visual Geometry Grounded Transformer
- Token Cropr: Faster ViTs for Quite a Few Tasks
- AffordDP: Generalizable Diffusion Policy with Transferable Affordance
- Taming Teacher Forcing for Masked Autoregressive Video Generation
- Vid2Avatar-Pro: Authentic Avatar from Videos in the Wild via Universal Prior
- FreqDebias: Towards Generalizable Deepfake Detection via Consistency-Driven Frequency Debiasing
- ArcPro: Architectural Programs for Structured 3D Abstraction of Sparse Points
- PatchDEMUX: A Certifiably Robust Framework for Multi-label Classifiers Against Adversarial Patches
- Multirate Neural Image Compression with Adaptive Lattice Vector Quantization
- EffiDec3D: An Optimized Decoder for High-Performance and Efficient 3D Medical Image Segmentation
- Explainable Saliency: Articulating Reasoning with Contextual Prioritization
- MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking
- ERUPT: Efficient Rendering with Unposed Patch Transformer
- SAIST: Segment Any Infrared Small Target Model Guided by Contrastive Language-Image Pretraining
- HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos
- ScribbleLight: Single Image Indoor Relighting with Scribbles
- HyperGS: Hyperspectral 3D Gaussian Splatting
- From Laboratory to Real World: A New Benchmark Towards Privacy-Preserved Visible-Infrared Person Re-Identification
- Multi-party Collaborative Attention Control for Image Customization
- CrossOver: 3D Scene Cross-Modal Alignment
- Charm: The Missing Piece in ViT Fine-Tuning for Image Aesthetic Assessment
- RASP: Revisiting 3D Anamorphic Art for Shadow-Guided Packing of Irregular Objects
- From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration
- 3D-GSW: 3D Gaussian Splatting for Robust Watermarking
- Layered Image Vectorization via Semantic Simplification
- Feat2GS: Probing Visual Foundation Models with Gaussian Splatting
- Doppelgängers and Adversarial Vulnerability
- Gaussian Splatting for Efficient Satellite Image Photogrammetry
- MeshGen: Generating PBR Textured Mesh with Render-Enhanced Auto-Encoder and Generative Data Augmentation
- CholecTrack20: A Multi-Perspective Tracking Dataset for Surgical Tools
- EasyHOI: Unleashing the Power of Large Models for Reconstructing Hand-Object Interactions in the Wild
- Dual Exposure Stereo for Extended Dynamic Range 3D Imaging
- Hypergraph Vision Transformers: Images are More than Nodes, More than Edges
- FG^2: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching
- The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion
- MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model
- LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty
- MonoPlace3D: Learning 3D-Aware Object Placement for 3D Monocular Detection
- ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer
- Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
- PUP 3D-GS: Principled Uncertainty Pruning for 3D Gaussian Splatting
- Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map
- PartGen: Part-level 3D Generation and Reconstruction with Multi-view Diffusion Models
- Resilient Sensor Fusion Under Adverse Sensor Failures via Multi-Modal Expert Fusion
- One-Step Event-Driven High-Speed Autofocus
- Kiss3DGen: Repurposing Image Diffusion Models for 3D Asset Generation
- VISTREAM: Improving Computation Efficiency of Visual Streaming Perception via Law-of-Charge-Conservation Inspired Spiking Neural Network
- Context-Enhanced Memory-Refined Transformer for Online Action Detection
- Pathways on the Image Manifold: Image Editing via Video Generation
- ESCAPE: Equivariant Shape Completion via Anchor Point Encoding
- OSLoPrompt: Bridging Low-Supervision Challenges and Open-Set Domain Generalization in CLIP
- Efficient Depth Estimation for Unstable Stereo Camera Systems on AR Glasses
- Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery
- JiSAM: Alleviate Labeling Burden and Corner Case Problems in Autonomous Driving via Minimal Real-World Data
- Revisiting Audio-Visual Segmentation with Vision-Centric Transformer
- Unboxed: Geometrically and Temporally Consistent Video Outpainting
- Binarized Mamba-Transformer for Lightweight Quad Bayer HybridEVS Demosaicing
- Gazing Into Missteps: Leveraging Eye-Gaze for Unsupervised Mistake Detection in Egocentric Videos of Skilled Human Activities
- Generative Photomontage
- ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
- Solving Instance Detection from an Open-World Perspective
- Vision-Language Gradient Descent-driven All-in-One Deep Unfolding Networks
- HuMoCon: Concept Discovery for Human Motion Understanding
- Minding Fuzzy Regions: A Data-driven Alternating Learning Paradigm for Stable Lesion Segmentation
- Gradient Inversion Attacks on Parameter-Efficient Fine-Tuning
- Variance-Based Membership Inference Attacks Against Large-Scale Image Captioning Models
- CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation
- PSA-SSL: Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds
- MAGiC-SLAM: Multi-Agent Gaussian Globally Consistent SLAM
- MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views
- Rethinking Spiking Self-Attention Mechanism: Implementing α-XNOR Similarity Calculation in Spiking Transformers
- Medusa: A Multi-Scale High-order Contrastive Dual-Diffusion Approach for Multi-View Clustering
- Disentangling Safe and Unsafe Image Corruptions via Anisotropy and Locality
- HVI: A New Color Space for Low-light Image Enhancement
- KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation
- BioX-CPath: Biologically-driven Explainable Diagnostics for Multistain IHC Computational Pathology
- Towards Explainable and Unprecedented Accuracy in Matching Challenging Finger Crease Patterns
- Generating Multimodal Driving Scenes via Next-Scene Prediction
- 4Deform: Neural Surface Deformation for Robust Shape Interpolation
- A Tale of Two Classes: Adapting Supervised Contrastive Learning to Binary Imbalanced Datasets
- HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
- Rethinking Noisy Video-Text Retrieval via Relation-aware Alignment
- UNEM: UNrolled Generalized EM for Transductive Few-Shot Learning
- Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues
- LOCORE: Image Re-ranking with Long-Context Sequence Modeling
- Self-Expansion of Pre-trained Models with Mixture of Adapters for Continual Learning
- Multi-subject Open-set Personalization in Video Generation
- Lux Post Facto: Learning Portrait Performance Relighting with Conditional Video Diffusion and a Hybrid Dataset
- Geometry Field Splatting with Gaussian Surfels
- RoboPEPP: Vision-Based Robot Pose and Joint Angle Estimation through Embedding Predictive Pre-Training
- RobSense: A Robust Multi-modal Foundation Model for Remote Sensing with Static, Temporal, and Incomplete Data Adaptability
- SATA: Spatial Autocorrelation Token Analysis for Enhancing the Robustness of Vision Transformers
- Video Depth without Video Models
- UMotion: Uncertainty-driven Human Motion Estimation from Inertial and Ultra-wideband Units
- Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability
- Do Your Best and Get Enough Rest for Continual Learning
- Mind the Gap: Detecting Black-box Adversarial Attacks in the Making through Query Update Analysis
- HOIGPT: Learning Long-Sequence Hand-Object Interaction with Language Models
- EDM: Equirectangular Projection-Oriented Dense Kernelized Feature Matching
- Descriptor-In-Pixel : Point-Feature Tracking For Pixel Processor Arrays
- VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness
- T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation
- Towards Cost-Effective Learning: A Synergy of Semi-Supervised and Active Learning
- Advancing Manga Analysis: Comprehensive Segmentation Annotations for the Manga109 Dataset
- Reducing Class-wise Confusion for Incremental Learning with Disentangled Manifolds
- Vision-Language Model IP Protection via Prompt-based Learning
- Decompositional Neural Scene Reconstruction with Generative Diffusion Prior
- LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding
- How Do I Do That? Synthesizing 3D Hand Motion and Contacts for Everyday Interactions
- Temporal Alignment-Free Video Matching for Few-shot Action Recognition
- Graph Neural Network Combining Event Stream and Periodic Aggregation for Low-Latency Event-based Vision
- Exploiting Deblurring Networks for Radiance Fields
- EBS-EKF: Accurate and High Frequency Event-based Star Tracking
- GAF: Gaussian Avatar Reconstruction from Monocular Videos via Multi-view Diffusion
- CH3Depth: Efficient and Flexible Depth Foundation Model with Flow Matching
- Common3D: Self-Supervised Learning of 3D Morphable Models for Common Objects in Neural Feature Space
- RigGS: Rigging of 3D Gaussians for Modeling Articulated Objects in Videos
- OSMamba: Omnidirectional Spectral Mamba with Dual-Domain Prior Generator for Exposure Correction
- ForestLPR: LiDAR Place Recognition in Forests Attentioning Multiple BEV Density Images
- Distraction is All You Need for Multimodal Large Language Model Jailbreaking
- Generative Gaussian Splatting for Unbounded 3D City Generation
- GarmentPile: Point-Level Visual Affordance Guided Retrieval and Adaptation for Cluttered Garments Manipulation
- Seeing the Abstract: Translating the Abstract Language for Vision Language Models
- Tiled Diffusion
- Dual-Agent Optimization framework for Cross-Domain Few-Shot Segmentation
- PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models
- BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models
- UnCommon Objects in 3D
- EditSplat: Multi-View Fusion and Attention-Guided Optimization for View-Consistent 3D Scene Editing with 3D Gaussian Splatting
- CaricatureBooth: Data-Free Interactive Caricature Generation in a Photo Booth
- SemanticDraw: Towards Real-Time Interactive Content Creation from Image Diffusion Models
- Robotic Visual Instruction
- An Image-like Diffusion Method for Human-Object Interaction Detection
- PolarFree: Polarization-based Reflection-Free Imaging
- Harnessing Global-Local Collaborative Adversarial Perturbation for Anti-Customization
- Generative Omnimatte: Learning to Decompose Video into Layers
- Recovering Dynamic 3D Sketches from Videos
- Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method
- Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision
- FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding
- Hyperbolic Uncertainty-Aware Few-Shot Incremental Point Cloud Segmentation
- Physical Plausibility-aware Trajectory Prediction via Locomotion Embodiment
- FedCS: Coreset Selection for Federated Learning
- EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues
- TaoAvatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting
- Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression
- Joint Optimization of Neural Radiance Fields and Continuous Camera Motion from a Monocular Video
- Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters
- Efficient Test-time Adaptive Object Detection via Sensitivity-Guided Pruning
- Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation
- Complementary Advantages: Exploiting Cross-Field Frequency Correlation for NIR-Assisted Image Denoising
- R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization
- T-CIL: Temperature Scaling using Adversarial Perturbation for Calibration in Class-Incremental Learning
- DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering
- PerLA: Perceptive 3D Language Assistant
- Compositional Caching for Training-free Open-vocabulary Attribute Detection
- PanSplat: 4K Panorama Synthesis with Feed-Forward Gaussian Splatting
- ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large Language Models
- SCSA: A Plug-and-Play Semantic Continuous-Sparse Attention for Arbitrary Semantic Style Transfer
- GENMANIP: LLM-driven Simulation for Generalizable Instruction-Following Manipulation
- VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation
- AvatarArtist: Open-Domain 4D Avatarization
- Few-shot Personalized Scanpath Prediction
- SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images
- Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding
- Towards Universal Dataset Distillation via Task-Driven Diffusion
- Efficient Event-Based Object Detection: A Hybrid Neural Network with Spatial and Temporal Attention
- SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment
- WildGS-SLAM: Monocular Gaussian Splatting SLAM in Dynamic Environments
- Spotting the Unexpected (STU): A 3D LiDAR Dataset for Anomaly Segmentation in Autonomous Driving
- Spectral Informed Mamba for Robust Point Cloud Processing
- Co-op: Correspondence-based Novel Object Pose Estimation
- RePerformer: Immersive Human-centric Volumetric Videos from Playback to Photoreal Reperformance
- Enhancing 3D Gaze Estimation in the Wild using Weak Supervision with Gaze Following Labels
- Seeing is Not Believing: Adversarial Natural Object Optimization for Hard-Label 3D Scene Attacks
- Full-DoF Egomotion Estimation for Event Cameras Using Geometric Solvers
- Dynamic Camera Poses and Where to Find Them
- LoRACLR: Contrastive Adaptation for Customization of Diffusion Models
- No Pains, More Gains: Recycling Sub-Salient Patches for Efficient High-Resolution Image Recognition
- Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields
- Arc2Avatar: Generating Expressive 3D Avatars from a Single Image via ID Guidance
- EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis
- PoseBH: Prototypical Multi-Dataset Training Beyond Human Pose Estimation
- QuCOOP: A Versatile Framework for Solving Composite and Binary-Parametrised Problems on Quantum Annealers
- Noise-Consistent Siamese-Diffusion for Medical Image Synthesis and Segmentation
- Classifier-Free Guidance Inside the Attraction Basin May Cause Memorization
- Overcoming Shortcut Problem in VLM for Robust Out-of-Distribution Detection
- TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion
- SOGS: Second-Order Anchor for Advanced 3D Gaussian Splatting
- Volumetrically Consistent 3D Gaussian Rasterization
- Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key
- Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding
- Coherent 3D Portrait Video Reconstruction via Triplane Fusion
- DiffLocks: Generating 3D Hair from a Single Image using Diffusion Models
- Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration
- Seeing A 3D World in A Grain of Sand
- MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures
- CRISP: Object Pose and Shape Estimation with Test-Time Adaptation
- Vision-Guided Action: Enhancing 3D Human Motion Prediction with Gaze-informed Affordance in 3D Scenes
- HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation
- UniVAD: A Training-free Unified Model for Few-shot Visual Anomaly Detection
- InsTaG: Learning Personalized 3D Talking Head from Few-Second Video
- ArtiFade: Learning to Generate High-quality Subject from Blemished Images
- DTOS: Dynamic Time Object Sensing with Large Multimodal Model
- Geometry-guided Online 3D Video Synthesis with Multi-View Temporal Consistency
- Question-Aware Gaussian Experts for Audio-Visual Question Answering
- Efficient Decoupled Feature 3D Gaussian Splatting via Hierarchical Compression
- FALCON: Fairness Learning via Contrastive Attention Approach to Continual Semantic Scene Understanding
- MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting
- Focusing on Tracks for Online Multi-Object Tracking
- DIFFER: Disentangling Identity Features via Semantic Cues for Clothes-Changing Person Re-ID
- FruitNinja: 3D Object Interior Texture Generation with Gaussian Splatting
- Is `Right' Right? Enhancing Object Orientation Understanding in Multimodal Large Language Models through Egocentric Instruction Tuning
- DyMO: Training-Free Diffusion Model Alignment with Dynamic Multi-Objective Scheduling
- CASP: Consistency-aware Audio-induced Saliency Prediction Model for Omnidirectional Video
- Parallelized Autoregressive Visual Generation
- CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image
- Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation
- STiL: Semi-supervised Tabular-Image Learning for Comprehensive Task-Relevant Information Exploration in Multimodal Classification
- MammAlps: A Multi-view Video Behavior Monitoring Dataset of Wild Mammals in the Swiss Alps
- Decoupled Motion Expression Video Segmentation
- MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation
- FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs
- DecoupledGaussian: Object-Scene Decoupling for Physics-Based Interaction
- Identifying and Mitigating Position Bias of Multi-image Vision-Language Models
- Thin-Shell-SfT: Fine-Grained Monocular Non-rigid 3D Surface Tracking with Neural Deformation Fields
- GCC: Generative Color Constancy via Diffusing a Color Checker
- Semantic-guided Cross-Modal Prompt Learning for Skeleton-based Zero-shot Action Recognition
- DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
- A Unified Framework for Heterogeneous Semi-supervised Learning
- ViKIENet: Towards Efficient 3D Object Detection with Virtual Key Instance Enhanced Network
- FluxSpace: Disentangled Semantic Editing in Rectified Flow Models
- Image Reconstruction from Readout-Multiplexed Single-Photon Detector Arrays
- MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention
- Generative Map Priors for Collaborative BEV Semantic Segmentation
- Dynamic Stereotype Theory Induced Micro-expression Recognition with Oriented Deformation
- Gazing at Rewards: Eye Movements as a Lens into Human and AI Decision-Making in Hybrid Visual Foraging
- InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions
- MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving
- Explaining in Diffusion: Explaining a Classifier with Diffusion Semantics
- DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation
- ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate
- Synthetic Prior for Few-Shot Drivable Head Avatar Inversion
- Olympus: A Universal Task Router for Computer Vision Tasks
- Identity-Preserving Text-to-Video Generation by Frequency Decomposition
- ILIAS: Instance-Level Image retrieval At Scale
- VI^3NR: Variance Informed Initialization for Implicit Neural Representations
- PhysicsGen: Can Generative Models Learn from Images to Predict Complex Physical Relations?
- NeRFPrior: Learning Neural Radiance Field as a Prior for Indoor Scene Reconstruction
- HyperPose: Hypernetwork-Infused Camera Pose Localization and an Extended Cambridge Landmarks Dataset
- HyperLoRA: Parameter-Efficient Adaptive Generation for Portrait Synthesis
- FrugalNeRF: Fast Convergence for Extreme Few-shot Novel View Synthesis without Learned Priors
- Parametric Point Cloud Completion for Polygonal Surface Reconstruction
- Classifier-to-Bias: Toward Unsupervised Automatic Bias Detection for Visual Classifiers
- FSboard: Over 3 Million Characters of ASL Fingerspelling Collected via Smartphones
- Insightful Instance Features for 3D Instance Segmentation
- WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model
- A Bias-Free Training Paradigm for More General AI-generated Image Detection
- R2C: Mapping Room to Chessboard to Unlock LLM As Low-Level Action Planner
- Apply Hierarchical-Chain-of-Generation to Complex Attributes Text-to-3D Generation
- Cheb-GR: Rethinking K-nearest Neighbor Search in Re-ranking for Person Re-identification
- Multi-Modal Contrastive Masked Autoencoders: A Two-Stage Progressive Pre-training Approach for RGBD Datasets
- MagicArticulate: Make Your 3D Models Articulation-Ready
- MDP: Multidimensional Vision Model Pruning with Latency Constraint
- Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders
- Mask^2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation
- PromptHash: Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing Retrieval
- MExD: An Expert-Infused Diffusion Model for Whole-Slide Image Classification
- ObjectMover: Generative Object Movement with Video Prior
- Circumventing Shortcuts in Audio-visual Deepfake Detection Datasets with Unsupervised Learning
- AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos
- HandOS: 3D Hand Reconstruction in One Stage
- Do Computer Vision Foundation Models Learn the Low-level Characteristics of the Human Visual System?
- VoteFlow: Enforcing Local Rigidity in Self-Supervised Scene Flow
- NoiseCtrl: A Sampling-Algorithm-Agnostic Conditional Generation Method for Diffusion Models
- SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images
- Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language
- Mesh Mamba: A Unified State Space Model for Saliency Prediction in Non-Textured and Textured Meshes
- VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment
- VideoHandles: Editing 3D Object Compositions in Videos Using Video Generative Priors
- MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research
- UltraFusion: Ultra High Dynamic Imaging using Exposure Fusion
- Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis
- Hybrid Concept Bottleneck Models
- Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
- PatchGuard: Adversarially Robust Anomaly Detection and Localization through Vision Transformers and Pseudo Anomalies
- Auto-Encoded Supervision for Perceptual Image Super-Resolution
- Dynamic Content Prediction with Motion-aware Priors for Blind Face Video Restoration
- FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation
- STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection
- BADGR: Bundle Adjustment Diffusion Conditioned by Gradients for Wide-Baseline Floor Plan Reconstruction
- Diffusion-based Realistic Listening Head Generation via Hybrid Motion Modeling
- Less is More: Efficient Image Vectorization with Adaptive Parameterization
- PS-Diffusion: Photorealistic Subject-Driven Image Editing with Disentangled Control and Attention
- Toward Real-world BEV Perception: Depth Uncertainty Estimation via Gaussian Splatting
- Reloc3r: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization
- GauSTAR: Gaussian Surface Tracking and Reconstruction
- ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos
- Mamba-Adaptor: State Space Model Adaptor for Visual Recognition
- Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis
- Saliuitl: Ensemble Salience Guided Recovery of Adversarial Patches against CNNs
- Paint by Inpaint: Learning to Add Image Objects by Removing Them First
- Plug-and-Play Versatile Compressed Video Enhancement
- 3D Occupancy Prediction with Low-Resolution Queries via Prototype-aware View Transformation
- Task Singular Vectors: Reducing Task Interference in Model Merging
- Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
- Localized Concept Erasure for Text-to-Image Diffusion Models Using Training-Free Gated Low-Rank Adaptation
- Generative Video Propagation
- PERSE: Personalized 3D Generative Avatars from A Single Portrait
- Any-Resolution AI-Generated Image Detection by Spectral Learning
- SAT-HMR: Real-Time Multi-Person 3D Mesh Estimation via Scale-Adaptive Tokens
- Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models
- Leveraging SD Map to Augment HD Map-based Trajectory Prediction
- Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model
- Sim-to-Real Causal Transfer: A Metric Learning Approach to Causally-Aware Interaction Representations
- LongDiff: Training-Free Long Video Generation in One Go
- SMTPD: A New Benchmark for Temporal Prediction of Social Media Popularity
- Relative Pose Estimation through Affine Corrections of Monocular Depth Priors
- Articulated Kinematics Distillation from Video Diffusion Models
- SKE-Layout: Spatial Knowledge Enhanced Layout Generation with LLMs
- ONDA-Pose: Occlusion-Aware Neural Domain Adaptation for Self-Supervised 6D Object Pose Estimation
- RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives
- vesselFM: A Foundation Model for Universal 3D Blood Vessel Segmentation
- VisionZip: Longer is Better but Not Necessary in Vision Language Models
- ZeroGrasp: Zero-Shot Shape Reconstruction Enabled Robotic Grasping
- Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval
- ActiveGAMER: Active GAussian Mapping through Efficient Rendering
- Scenario Dreamer: Vectorized Latent Diffusion for Generating Driving Simulation Environments
- COBRA: COmBinatorial Retrieval Augmentation for Few-Shot Adaptation
- Consistent Normal Orientation for 3D Point Clouds via Least Squares on Delaunay Graph
- Gaussian Eigen Models for Human Heads
- Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback
- ImViD: Immersive Volumetric Videos for Enhanced VR Engagement
- UniRestore: Unified Perceptual and Task-Oriented Image Restoration Model Using Diffusion Prior
- Dense Dispersed Structured Light for Hyperspectral 3D Imaging of Dynamic Scenes
- Sea-ing in Low-light
- MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments
- O-TPT: Orthogonality Constraints for Calibrating Test-time Prompt Tuning in Vision-Language Models
- Self-Supervised Cross-View Correspondence with Predictive Cycle Consistency
- Towards Scalable Human-aligned Benchmark for Text-guided Image Editing
- Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
- GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities
- Believing is Seeing: Unobserved Object Detection using Generative Models
- SkillMimic: Learning Basketball Interaction Skills from Demonstrations
- Video-Bench: Human-Aligned Video Generation Benchmark
- SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement
- VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
- SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos
- DepthCues: Evaluating Monocular Depth Perception in Large Vision Models
- How to Merge Your Multimodal Models Over Time?
- Correcting Deviations from Normality: A Reformulated Diffusion Model for Multi-Class Unsupervised Anomaly Detection
- One-shot 3D Object Canonicalization based on Geometric and Semantic Consistency
- A Flag Decomposition for Hierarchical Datasets
- Occlusion-aware Text-Image-Point Cloud Pretraining for Open-World 3D Object Recognition
- Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion
- ProHOC: Probabilistic Hierarchical Out-of-Distribution Classification via Multi-Depth Networks
- Creating Your Editable 3D Photorealistic Avatar with Tetrahedron-constrained Gaussian Splatting
- LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos
- 3DEnhancer: Consistent Multi-View Diffusion for 3D Enhancement
- Detecting Open World Objects via Partial Attribute Assignment
- VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
- Structure from Collision
- All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages
- Free360: Layered Gaussian Splatting for Unbounded 360-Degree View Synthesis from Extremely Sparse and Unposed Views
- Scaling Down Text Encoders of Text-to-Image Diffusion Models
- Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation
- Shape Abstraction via Marching Differentiable Support Functions
- AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting
- Visual Agentic AI for Spatial Reasoning with a Dynamic API
- Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking
- Optimizing for the Shortest Path in Denoising Diffusion Model
- Attention Distillation: A Unified Approach to Visual Characteristics Transfer
- RainyGS: Efficient Rain Synthesis with Physically-Based Gaussian Splatting
- GPVK-VL: Geometry-Preserving Virtual Keyframes for Visual Localization under Large Viewpoint Changes
- GuardSplat: Efficient and Robust Watermarking for 3D Gaussian Splatting
- Gen3DEval: Using vLLMs for Automatic Evaluation of Generated 3D Objects
- VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step
- Exploring Contextual Attribute Density in Referring Expression Counting
- BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature
- BIGS: Bimanual Category-agnostic Interaction Reconstruction from Monocular Videos via 3D Gaussian Splatting
- Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation
- ZeroVO: Visual Odometry with Minimal Assumptions
- GazeGene: Large-scale Synthetic Gaze Dataset with 3D Eyeball Annotations
- BG-Triangle: Bézier Gaussian Triangle for 3D Vectorization and Rendering
- Generalized Few-shot 3D Point Cloud Segmentation with Vision-Language Model
- Video Summarization with Large Language Models
- Towards Human-Understandable Multi-Dimensional Concept Discovery
- Pos3R: 6D Pose Estimation for Unseen Objects Made Easy
- WildAvatar: Learning In-the-wild 3D Avatars from the Web
- SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance
- LC-Mamba: Local and Continuous Mamba with Shifted Windows for Frame Interpolation
- Focus-N-Fix: Region-Aware Fine-Tuning for Text-to-Image Generation
- SSHNet: Unsupervised Cross-modal Homography Estimation via Problem Reformulation and Split Optimization
- Cross-modal Information Flow in Multimodal Large Language Models
- CDI: Copyrighted Data Identification in Diffusion Models
- SyncSDE: A Probabilistic Framework for Diffusion Synchronization
- SharpDepth: Sharpening Metric Depth Predictions Using Diffusion Distillation
- Towards Precise Scaling Laws for Video Diffusion Transformers
- MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors
- FreeCloth: Free-form Generation Enhances Challenging Clothed Human Modeling
- The Impact Label Noise and Choice of Threshold has on Cross-Entropy and Soft-Dice in Image Segmentation
- GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs
- MetaWriter: Personalized Handwritten Text Recognition Using Meta-Learned Prompt Tuning
- A Semantic Knowledge Complementarity based Decoupling Framework for Semi-supervised Class-imbalanced Medical Image Segmentation
- MP-SfM: Monocular Surface Priors for Robust Structure-from-Motion
- Factored-NeuS: Reconstructing Surfaces, Illumination, and Materials of Possibly Glossy Objects
- DiverseFlow: Sample-Efficient Diverse Mode Coverage in Flows
- HD-EPIC: A Highly-Detailed Egocentric Video Dataset
- Inversion Circle Interpolation: Diffusion-based Image Augmentation for Data-scarce Classification
- SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes
- LoRA Recycle: Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models by Recycling Pre-Tuned LoRAs
- HuPerFlow: A Comprehensive Benchmark for Human vs. Machine Motion Estimation Comparison
- MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling
- Volumetric Surfaces: Representing Fuzzy Geometries with Layered Meshes
- DTGBrepGen: A Novel B-rep Generative Model through Decoupling Topology and Geometry
- InteractVLM: 3D Interaction Reasoning from 2D Foundational Models
- Finsler Multi-Dimensional Scaling: Manifold Learning for Asymmetric Dimensionality Reduction and Embedding
- Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models
- Unsupervised Discovery of Facial Landmarks and Head Pose
- Decoupling Training-Free Guided Diffusion by ADMM
- Joint Out-of-Distribution Filtering and Data Discovery Active Learning
- High-fidelity 3D Object Generation from Single Image with RGBN-Volume Gaussian Reconstruction Model
- HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation
- Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing
- IM-Portrait: Learning 3D-aware Video Diffusion for Photorealistic Talking Heads from Monocular VideosC
- Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis
- Directional Label Diffusion Model for Learning from Noisy Labels
- Improving Transferable Targeted Attacks with Feature Tuning Mixup
- Attribute-Missing Multi-view Graph Clustering
- A Unified Model for Compressed Sensing MRI Across Undersampling Patterns
- DynaMoDe-NeRF: Motion-aware Deblurring Neural Radiance Field for Dynamic Scenes
- It’s a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data
- GeoAvatar: Geometrically-Consistent Multi-Person Avatar Reconstruction from Sparse Multi-View Videos
- Splatter-360: Generalizable 360 Gaussian Splatting for Wide-baseline Panoramic Images
- Dynamic Neural Surfaces for Elastic 4D Shape Representation and Analysis
- RICCARDO: Radar Hit Prediction and Convolution for Camera-Radar 3D Object Detection
- GliaNet: Adaptive Neural Network Structure Learning with Glia-Driven
- GenAssets: Generating in-the-wild 3D Assets in Latent Space
- HomoGen: Enhanced Video Inpainting via Homography Propagation and Diffusion
- RelationField: Relate Anything in Radiance Fields
- Time of the Flight of the Gaussians: Optimizing Depth Indirectly in Dynamic Radiance Fields
- 4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models
- Bringing CLIP to the Clinic: Dynamic Soft Labels and Negation-Aware Learning for Medical Analysis
- FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views
- HybridMQA: Exploring Geometry-Texture Interactions for Colored Mesh Quality Assessment
- Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation
- ShiftwiseConv: Small Convolutional Kernel with Large Kernel Effect
- SUM Parts: Benchmarking Part-Level Semantic Segmentation of Urban Meshes
- Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics
- Parameterized Blur Kernel Prior Learning for Local Motion Deblurring
- Dynamic Motion Blending for Versatile Motion Editing
- StableAnimator: High-Quality Identity-Preserving Human Image Animation
- Speedy-Splat: Fast 3D Gaussian Splatting with Sparse Pixels and Sparse Primitives
- Category-Agnostic Neural Object Rigging
- Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation
- AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models
- Floxels: Fast Unsupervised Voxel Based Scene Flow Estimation
- Understanding Multi-layered Transmission Matrices
- Attraction Diminishing and Distributing for Few-Shot Class-Incremental Learning
- Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations
- GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control
- Meta-Learning Hyperparameters for Parameter Efficient Fine-Tuning
- Boosting the Dual-Stream Architecture in Ultra-High Resolution Segmentation with Resolution-Biased Uncertainty Estimation
- DOF-GS: Adjustable Depth-of-Field 3D Gaussian Splatting for Post-Capture Refocusing, Defocus Rendering and Blur Removal
- HyperNVD: Accelerating Neural Video Decomposition via Hypernetworks
- Good, Cheap, and Fast: Overfitted Image Compression with Wasserstein Distortion
- Spectral State Space Model for Rotation-Invariant Visual Representation Learning
- Towards Enhanced Image Inpainting: Mitigating Unwanted Object Insertion and Preserving Color Consistency
- MonoInstance: Enhancing Monocular Priors via Multi-view Instance Alignment for Neural Rendering and Reconstruction
- DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment
- Advancing Generalizable Tumor Segmentation with Anomaly-Aware Open-Vocabulary Attention Maps and Frozen Foundation Diffusion Models
- Can Text-to-Video Generation help Video-Language Alignment?
- Scene-Centric Unsupervised Panoptic Segmentation
- GraphI2P: Image-to-Point Cloud Registration with Exploring Pattern of Correspondence via Graph Learning
- 3D Student Splatting and Scooping
- Continuous Locomotive Crowd Behavior Generation
- DexGrasp Anything: Towards Universal Robotic Dexterous Grasping with Physics Awareness
- 3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes
- Data Distributional Properties As Inductive Bias for Systematic Generalization
- Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge
- Closest Neighbors are Harmful for Lightweight Masked Auto-encoders
- Exploiting Temporal State Space Sharing for Video Semantic Segmentation
- PRaDA: Projective Radial Distortion Averaging
- A Simple Data Augmentation for Feature Distribution Skewed Federated Learning
- Functionality Understanding and Segmentation in 3D Scenes
- Potential Field Based Deep Metric Learning
- Arbitrary-steps Image Super-resolution via Diffusion Inversion
- Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement
- Ego4o: Egocentric Human Motion Capture and Understanding from Multi-Modal Input
- BLADE: Single-view Body Mesh Estimation through Accurate Depth Estimation
- Text Augmented Correlation Transformer For Few-shot Classification & Segmentation
- Prof. Robot: Differentiable Robot Rendering Without Static and Self-Collisions
- Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data
- SketchVideo: Sketch-based Video Generation and Editing
- F-LMM: Grounding Frozen Large Multimodal Models
- Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction
- TopoCellGen: Generating Histopathology Cell Topology with a Diffusion Model
- Unified Uncertainty-Aware Diffusion for Multi-Agent Trajectory Modeling
- Embodied Scene Understanding for Vision Language Models via MetaVQA
- Robust 3D Shape Reconstruction in Zero-Shot from a Single Image in the Wild
- Cross-Modal Distillation for 2D/3D Multi-Object Discovery from 2D Motion
- Cross-modal Causal Relation Alignment for Video Question Grounding
- Omni-Scene: Omni-Gaussian Representation for Ego-Centric Sparse-View Scene Reconstruction
- RoboGround: Robotic Manipulation with Grounded Vision-Language Priors
- SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion
- Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis
- Learning from Synchronization: Self-Supervised Uncalibrated Multi-View Person Association in Challenging Scenes
- UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming
- PGC: Physics-Based Gaussian Cloth from a Single Pose
- Learning Temporally Consistent Video Depth from Video Diffusion Priors
- Low-Rank Adaptation in Multilinear Operator Networks for Security-Preserving Incremental Learning
- Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective
- Hyperdimensional Uncertainty Quantification for Multimodal Uncertainty Fusion in Autonomous Vehicles Perception
- Language-Assisted Debiasing and Smoothing for Foundation Model-Based Semi-Supervised Learning
- Let Humanoids Hike! Integrative Skill Development on Complex Trails
- Towards More General Video-based Deepfake Detection through Facial Component Guided Adaptation for Foundation Model
- PersonaBooth: Personalized Text-to-Motion Generation
- DropGaussian: Structural Regularization for Sparse-view Gaussian Splatting
- Using Diffusion Priors for Video Amodal Segmentation
- Consistency Posterior Sampling for Diverse Image Synthesis
- RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins
- Satellite Observations Guided Diffusion Model for Accurate Meteorological States at Arbitrary Resolution
- HRAvatar: High-Quality and Relightable Gaussian Head Avatar
- DeepCompress-ViT: Rethinking Model Compression to Enhance Efficiency of Vision Transformers at the Edge
- DNF: Unconditional 4D Generation with Dictionary-based Neural Fields
- VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation
- APT: Adaptive Personalized Training for Diffusion Models with Limited Data
- FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations
- TreeMeshGPT: Artistic Mesh Generation with Autoregressive Tree Sequencing
- NexusGS: Sparse View Synthesis with Epipolar Depth Priors in 3D Gaussian Splatting
- 4DTAM: Non-Rigid Tracking and Mapping via Dynamic Surface Gaussians
- Material Anything: Generating Materials for Any 3D Object via Diffusion
- From Sparse to Dense: Camera Relocalization with Scene-Specific Detector from Feature Gaussian Splatting
- SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model
- Towards Realistic Example-based Modeling via 3D Gaussian Stitching
- MotionPRO: Exploring the Role of Pressure in Human MoCap and Beyond
- LesionLocator: Zero-Shot Universal Tumor Segmentation and Tracking in 3D Whole-Body Imaging
- BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
- DV-Matcher: Deformation-based Non-rigid Point Cloud Matching Guided by Pre-trained Visual Features
- DyCON: Dynamic Uncertainty-aware Consistency and Contrastive Learning for Semi-supervised Medical Image Segmentation
- MP-GUI: Modality Perception with MLLMs for GUI Understanding
- Bias for Action: Video Implicit Neural Representations with Bias Modulation
- Homogeneous Dynamics Space for Heterogeneous Humans
- Doppelgangers++: Improved Visual Disambiguation with Geometric 3D Features
- Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion
- h-Edit: Effective and Flexible Diffusion-Based Editing via Doob's h-Transform
- Generalized Recorrupted-to-Recorrupted: Self-Supervised Learning Beyond Gaussian Noise
- Adversarial Diffusion Compression for Real-World Image Super-Resolution
- Probability Density Geodesics in Image Diffusion Latent Space
- RNG: Relightable Neural Gaussians
- Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning
- Temporal Action Detection Model Compression by Progressive Block Drop
- HistoFS: Non-IID Histopathologic Whole Slide Image Classification via Federated Style Transfer with RoI-Preserving
- GCE-Pose: Global Context Enhancement for Category-level Object Pose Estimation
- Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image Segmentation
- StageDesigner: Artistic Stage Generation for Scenography via Theater Scripts
- Vision-Language Models Do Not Understand Negation
- PCDreamer: Point Cloud Completion Through Multi-view Diffusion Priors
- V2V3D: View-to-View Denoised 3D Reconstruction for Light Field Microscopy
- Birth and Death of a Rose
- DoF-Gaussian: Controllable Depth-of-Field for 3D Gaussian Splatting
- Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera
- Latent Space Imaging
- Unlocking the Potential of Unlabeled Data in Semi-Supervised Domain Generalization
- 3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion
- Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages
- Open Set Label Shift with Test Time Out-of-Distribution Reference
- BimArt: A Unified Approach for the Synthesis of 3D Bimanual Interaction with Articulated Objects
- GLASS: Guided Latent Slot Diffusion for Object-Centric Learning
- Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attack on Breast Ultrasound Images
- PICD: Versatile Perceptual Image Compression with Diffusion Rendering
- Task-Specific Gradient Adaptation for Few-Shot One-Class Classification
- Rethinking Decoder Design: Improving Biomarker Segmentation Using Depth-to-Space Restoration and Residual Linear Attention
- HalLoc: Token-level Localization of Hallucinations for Vision Language Models
- HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation
- DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis
- ArticulatedGS: Self-supervised Digital Twin Modeling of Articulated Objects using 3D Gaussian Splatting
- Subnet-Aware Dynamic Supernet Training for Neural Architecture Search
- LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
- Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy
- Distilling Long-tailed Datasets
- Unsupervised Foundation Model-Agnostic Slide-Level Representation Learning
- SplineGS: Robust Motion-Adaptive Spline for Real-Time Dynamic 3D Gaussians from Monocular Video
- CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning
- Disco4D: Disentangled 4D Human Generation and Animation from a Single Image
- HoGS: Unified Near and Far Object Reconstruction via Homogeneous Gaussian Splatting
- Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
- Enhancing Creative Generation on Stable Diffusion-based Models
- DIO: Decomposable Implicit 4D Occupancy-Flow World Model
- 3D Gaussian Inpainting with Depth-Guided Cross-View Consistency
- GLane3D: Detecting Lanes with Graph of 3D Keypoints
- Deep Change Monitoring: A Hyperbolic Representative Learning Framework and a Dataset for Long-term Fine-grained Tree Change Detection
- Learning Endogenous Attention for Incremental Object Detection
- CoMapGS: Covisibility Map-based Gaussian Splatting for Sparse Novel View Synthesis
- Blurred LiDAR for Sharper 3D: Robust Handheld 3D Scanning with Diffuse LiDAR and RGB
- Real-time High-fidelity Gaussian Human Avatars with Position-based Interpolation of Spatially Distributed MLPs
- Precise, Fast, and Low-cost Concept Erasure in Value Space: Orthogonal Complement Matters
- ETAP: Event-based Tracking of Any Point
- Segment Anything, Even Occluded
- ReSpec: Relevance and Specificity Grounded Online Filtering for Learning on Video-Text Data Streams
- Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos
- 3D-SLNR: A Super Lightweight Neural Representation for Large-scale 3D Mapping
- DAMM-Diffusion: Learning Divergence-Aware Multi-Modal Diffusion Model for Nanoparticles Distribution Prediction
- Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera
- Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment
- SPARS3R: Semantic Prior Alignment and Regularization for Sparse 3D Reconstruction
- Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
- Interactive Medical Image Analysis with Concept-based Similarity Reasoning
- Exploration-Driven Generative Interactive Environments
- PQPP: A Joint Benchmark for Text-to-Image Prompt and Query Performance Prediction
- GenVDM: Generating Vector Displacement Maps From a Single Image
- MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
- Sketch Down the FLOPs: Towards Efficient Networks for Human Sketch
- FactCheXcker: Mitigating Measurement Hallucinations in Chest X-ray Report Generation Models
- DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
- Semi-Supervised State-Space Model with Dynamic Stacking Filter for Real-World Video Deraining
- BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation
- Scene-agnostic Pose Regression for Visual Localization
- BF-STVSR: B-Splines and Fourier---Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution
- Geometry in Style: 3D Stylization via Surface Normal Deformation
- Nearly Zero-Cost Protection Against Mimicry by Personalized Diffusion Models
- SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE
- Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding
- 3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting
- ROLL: Robust Noisy Pseudo-label Learning for Multi-View Clustering with Noisy Correspondence
- Learning Physics-Based Full-Body Human Reaching and Grasping from Brief Walking References
Tutorials
- Foundations of Interpretable AI
- The 2nd Point Cloud Tutorial: All You Need To Know About 3D Point Cloud
- Scalable Generative Models in Computer Vision
- From Video Generation to World Model
- Volumetric Video in the Real World
- Cognitive AI for the Future: Agentic Multimodal Models and RAG for Vision Language Applications, from Training to Deployment
- Evaluating Large Multi-modal Models: Challenges and Methods
- Multi-Modal Computer Vision and Foundation Models In Agriculture in conjunction with IEEE CVPR 2025
- Robotics 101: An Odyssey from A Vision Perspective
- Animal re-identification
- Computer Vision over Homomorphically Encrypted Data
- Continuous Data Cycle via Foundation Models
- Edge AI in Action: Technologies and Applications
- Identifying Structure in Data: All you need to know about Dimensionality Reduction, Clustering and more
- Multimodal Mathematical Reasoning: Frontiers in Integrating Vision, Language, and Symbolic Representations
- Full-Stack, GPU-based Acceleration of Deep Learning and Foundation Models
- Power-efficient neural networks using low-precision data types and quantization
- Intelligent Healthcare based on Cameras and Wireless Sensors
- Recent Advances in Vision Foundation Models
Report issues here.