Skip to yearly menu bar
Skip to main content
Main Navigation
CVPR
Code of Conduct
Create Profile
Reset / Forgot Password
Privacy Policy
Contact CVPR
HELP/FAQ
Reset Password
My Stuff
Login
Select Year: (2025)
2025
2024
2023
Home
Schedule
Workshops
Tutorials
Keynotes
Awards
Highlights
Award Candidates
Papers
Sponsors
Organizers
Layout:
mini
compact
topic
detail
×
No topics available
No sessions available
title
author
topic
session
shuffle
by
serendipity
bookmarked first
visited first
not visited first
bookmarked but not visited
Enable Javascript in your browser to see the papers page.
Memories of Forgotten Concepts
Seeing More with Less: Human-like Representations in Vision Models
Erasing Undesirable Influence in Diffusion Models
Random Conditioning with Distillation for Data-Efficient Diffusion Model Compression
A Unified, Resilient, and Explainable Adversarial Patch Detector
StarVector: Generating Scalable Vector Graphics Code from Images and Text
Toward Robust Neural Reconstruction from Sparse Point Sets
UniPhy: Learning a Unified Constitutive Model for Inverse Physics Simulation
DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion
Unseen Visual Anomaly Generation
Enhancing Adversarial Transferability with Checkpoints of a Single Model’s Training
Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks
Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations
APT: Adaptive Personalized Training for Diffusion Models with Limited Data
Gyro-based Neural Single Image Deblurring
Flexible Frame Selection for Efficient Video Reasoning
FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis
HyperNet Fields: Efficiently Training Hypernetworks without Ground Truth by Learning Weight Trajectories
HSI: A Holistic Style Injector for Arbitrary Style Transfer
Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction
Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation
ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding
Reconstructing Animals and the Wild
Task Singular Vectors: Reducing Task Interference in Model Merging
Preconditioners for the Stochastic Training of Neural Fields
Navigating Image Restoration with VAR’s Distribution Alignment Prior
A General Adaptive Dual-level Weighting Mechanism for Remote Sensing Pansharpening
CryptoFace: End-to-End Encrypted Face Recognition
OralXrays-9: Towards Hospital-Scale Panoramic X-ray Anomaly Detection via Personalized Multi-Object Query-Aware Mining
CheXwhatsApp: A Dataset for Exploring Challenges in the Diagnosis of Chest X-rays through Mobile Devices
VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness
DH-Set: Improving Vision-Language Alignment with Diverse and Hybrid Set-Embeddings Learning
MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation
Controllable Human Image Generation with Personalized Multi-Garments
What’s in the Image? A Deep-Dive into the Vision of Vision Language Models
VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis
Parameter Efficient Mamba Tuning via Projector-targeted Diagonal-centric Linear Transformation
Text Augmented Correlation Transformer For Few-shot Classification & Segmentation
Just Dance with pi! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection
Pathways on the Image Manifold: Image Editing via Video Generation
Paint by Inpaint: Learning to Add Image Objects by Removing Them First
MVDoppler-Pose: Multi-Modal Multi-View mmWave Sensing for Long-Distance Self-Occluded Human Walking Pose Estimation
Towards Generalizable Scene Change Detection
CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation
Few-shot Personalized Scanpath Prediction
Neuron: Learning Context-Aware Evolving Representations for Zero-Shot Skeleton Action Recognition
Stop Learning it all to Mitigate Visual Hallucination, Focus on the Hallucination Target.
Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning
DepthSplat: Connecting Gaussian Splatting and Depth
DeDe: Detecting Backdoor Samples for SSL Encoders via Decoders
PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models
MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations
ReSpec: Relevance and Specificity Grounded Online Filtering for Learning on Video-Text Data Streams
T-FAKE: Synthesizing Thermal Images for Facial Landmarking
Let Humanoids Hike! Integrative Skill Development on Complex Trails
Opportunistic Single-Photon Time of Flight
Foundations of the Theory of Performance-Based Ranking
3D-GSW: 3D Gaussian Splatting for Robust Watermarking
Insightful Instance Features for 3D Instance Segmentation
TreeMeshGPT: Artistic Mesh Generation with Autoregressive Tree Sequencing
Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification
Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers
GIF: Generative Inspiration for Face Recognition at Scale
Augmenting Perceptual Super-Resolution via Image Quality Predictors
AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
DiverseFlow: Sample-Efficient Diverse Mode Coverage in Flows
Data-Free Group-Wise Fully Quantized Winograd Convolution via Learnable Scales
Certified Human Trajectory Prediction
MotionMap: Representing Multimodality in Human Pose Forecasting
ViUniT: Visual Unit Tests for More Robust Visual Programming
Gradient Inversion Attacks on Parameter-Efficient Fine-Tuning
AG-VPReID: A Challenging Large-Scale Benchmark for Aerial-Ground Video-based Person Re-Identification
PyTorchGeoNodes: Enabling Differentiable Shape Programs for 3D Shape Reconstruction
VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation
GLASS: Guided Latent Slot Diffusion for Object-Centric Learning
Scene-Centric Unsupervised Panoptic Segmentation
Floxels: Fast Unsupervised Voxel Based Scene Flow Estimation
PSA-SSL: Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds
MaDCoW: Marginal Distortion Correction for Wide-Angle Photography with Arbitrary Objects
VI^3NR: Variance Informed Initialization for Implicit Neural Representations
Pos3R: 6D Pose Estimation for Unseen Objects Made Easy
Large Self-Supervised Models Bridge the Gap in Domain Adaptive Object Detection
Spotting the Unexpected (STU): A 3D LiDAR Dataset for Anomaly Segmentation in Autonomous Driving
Two by Two: Learning Multi-Task Pairwise Objects Assembly for Generalizable Robot Manipulation
SelfSplat: Pose-Free and 3D Prior-Free Generalizable 3D Gaussian Splatting
Believing is Seeing: Unobserved Object Detection using Generative Models
Towards In-the-wild 3D Plane Reconstruction from a Single Image
Minority-Focused Text-to-Image Generation via Prompt Optimization
DrVideo: Document Retrieval Based Long Video Understanding
Polarized Color Screen Matting
Style-Editor: Text-driven Object-centric Style Editing
Cross-View Completion Models are Zero-shot Correspondence Estimators
Dense-SfM: Structure from Motion with Dense Consistent Matching
Co-op: Correspondence-based Novel Object Pose Estimation
Tripartite Weight-Space Ensemble for Few-Shot Class-Incremental Learning
ALIEN: Implicit Neural Representations for Human Motion Prediction under Arbitrary Latency
Identity-preserving Distillation Sampling by Fixed-Point Iterator
POp-GS: Next Best View in 3D-Gaussian Splatting with P-Optimality
Boosting Point-Supervised Temporal Action Localization through Integrating Query Reformation and Optimal Transport
Seurat: From Moving Points to Depth
Exploring Temporally-Aware Features for Point Tracking
Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling
Annotation Ambiguity Aware Semi-Supervised Medical Image Segmentation
Perturb-and-Revise: Flexible 3D Editing with Generative Trajectories
SapiensID: Foundation for Human Recognition
Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis
Light3R-SfM: Towards Feed-forward Structure-from-Motion
Category-Agnostic Neural Object Rigging
Test-time Augmentation Improves Efficiency in Conformal Prediction
Do ImageNet-trained Models Learn Shortcuts? The Impact of Frequency Shortcuts on Generalization
Self-Supervised Large Scale Point Cloud Completion for Archaeological Site Restoration
T-CIL: Temperature Scaling using Adversarial Perturbation for Calibration in Class-Incremental Learning
MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation
Dynamic Pseudo Labeling via Gradient Cutting for High-Low Entropy Exploration
Localized Concept Erasure for Text-to-Image Diffusion Models Using Training-Free Gated Low-Rank Adaptation
NexusGS: Sparse View Synthesis with Epipolar Depth Priors in 3D Gaussian Splatting
Exploiting Temporal State Space Sharing for Video Semantic Segmentation
HUSH: Holistic Panoramic 3D Scene Understanding using Spherical Harmonics
UniRestore: Unified Perceptual and Task-Oriented Image Restoration Model Using Diffusion Prior
Multi-Granularity Class Prototype Topology Distillation for Class-Incremental Source-Free Unsupervised Domain Adaptation
VinaBench: Benchmark for Faithful and Consistent Visual Narratives
Textured Gaussians for Enhanced 3D Scene Appearance Modeling
Decentralized Diffusion Models
Arc2Avatar: Generating Expressive 3D Avatars from a Single Image via ID Guidance
Free on the Fly: Enhancing Flexibility in Test-Time Adaptation with Online EM
EDM: Equirectangular Projection-Oriented Dense Kernelized Feature Matching
Sufficient Invariant Learning for Distribution Shift
CheckManual: A New Challenge and Benchmark for Manual-based Appliance Manipulation
Explaining in Diffusion: Explaining a Classifier with Diffusion Semantics
LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions
VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
GENMANIP: LLM-driven Simulation for Generalizable Instruction-Following Manipulation
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
PS-EIP: Robust Photometric Stereo Based on Event Interval Profile
Structure from Collision
Adapter Merging with Centroid Prototype Mapping for Scalable Class-Incremental Learning
D^3-Human: Dynamic Disentangled Digital Human from Monocular Video
Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass
PDFactor: Learning Tri-Perspective View Policy Diffusion Field for Multi-Task Robotic Manipulation
FlowRAM: Grounding Flow Matching Policy with Region-Aware Mamba Framework for Robotic Manipulation
FiRe: Fixed-points of Restoration Priors for Solving Inverse Problems
Generalized Zero-Shot Classification via Semantics-Free Inter-Class Feature Generation
SceneDiffuser++: City-Scale Traffic Simulation via a Generative World Model
Segment This Thing: Foveated Tokenization for Efficient Point-Prompted Segmentation
SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving
EvOcc: Accurate Semantic Occupancy for Automated Driving Using Evidence Theory
PhysAnimator: Physics-Guided Generative Cartoon Animation
TANGO: Training-free Embodied AI Agents for Open-world Tasks
Focal Split: Untethered Snapshot Depth from Differential Defocus
Task-Aware Clustering for Prompting Vision-Language Models
Forensic Self-Descriptions Are All You Need for Zero-Shot Detection, Open-Set Source Attribution, and Clustering of AI-generated Images
Evaluating Model Perception of Color Illusions in Photorealistic Scenes
CoCoGaussian: Leveraging Circle of Confusion for Gaussian Splatting from Defocused Images
CARL: A Framework for Equivariant Image Registration
Autoregressive Distillation of Diffusion Transformers
FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute
Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models
FSboard: Over 3 Million Characters of ASL Fingerspelling Collected via Smartphones
Common3D: Self-Supervised Learning of 3D Morphable Models for Common Objects in Neural Feature Space
Thin-Shell-SfT: Fine-Grained Monocular Non-rigid 3D Surface Tracking with Neural Deformation Fields
LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty
Pattern Analogies: Learning to Perform Programmatic Image Edits by Analogy
RUBIK: A Structured Benchmark for Image Matching across Geometric Challenges
Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics
ProbeSDF: Light Field Probes For Neural Surface Reconstruction
Finsler Multi-Dimensional Scaling: Manifold Learning for Asymmetric Dimensionality Reduction and Embedding
Graph Neural Network Combining Event Stream and Periodic Aggregation for Low-Latency Event-based Vision
A Theory of Learning Unified Model via Knowledge Integration from Label Space Varying Domains
The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition
CrossSDF: 3D Reconstruction of Thin Structures From Cross-Sections
Geometry-guided Online 3D Video Synthesis with Multi-View Temporal Consistency
Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing
Tuning the Frequencies: Robust Training for Sinusoidal Neural Networks
Flash-Split: 2D Reflection Removal with Flash Cues and Latent Diffusion Separation
POT: Prototypical Optimal Transport for Weakly Supervised Semantic Segmentation
Perceptual Inductive Bias Is What You Need Before Contrastive Learning
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
Pseudo Visible Feature Fine-Grained Fusion for Thermal Object Detection
KMD: Koopman Multi-modality Decomposition for Generalized Brain Tumor Segmentation under Incomplete Modalities
Enhancing Few-Shot Class-Incremental Learning via Training-Free Bi-Level Modality Calibration
Adaptive Keyframe Sampling for Long Video Understanding
Revisiting Backdoor Attacks against Large Vision-Language Models from Domain Shift
3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer
Electromyography-Informed Facial Expression Reconstruction for Physiological-Based Synthesis and Analysis
CLOC: Contrastive Learning for Ordinal Classification with Multi-Margin N-pair Loss
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding
Unsupervised Foundation Model-Agnostic Slide-Level Representation Learning
Theory-Inspired Deep Multi-View Multi-Label Learning with Incomplete Views and Noisy Labels
From Elements to Design: A Layered Approach for Automatic Graphic Design Composition
Hyperspectral Pansharpening via Diffusion Models with Iteratively Zero-Shot Guidance
TinyFusion: Diffusion Transformers Learned Shallow
VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation
EditSplat: Multi-View Fusion and Attention-Guided Optimization for View-Consistent 3D Scene Editing with 3D Gaussian Splatting
Your ViT is Secretly an Image Segmentation Model
Using Diffusion Priors for Video Amodal Segmentation
Efficient Dynamic Scene Editing via 4D Gaussian-based Static-Dynamic Separation
Hyperdimensional Uncertainty Quantification for Multimodal Uncertainty Fusion in Autonomous Vehicles Perception
NSD-Imagery: A Benchmark Dataset for Extending fMRI Vision Decoding Methods to Mental Imagery
ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models
Hyperbolic Safety-Aware Vision-Language Models
GAF: Gaussian Avatar Reconstruction from Monocular Videos via Multi-view Diffusion
SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding
Interpretable Generative Models through Post-hoc Concept Bottlenecks
ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions
ESCAPE: Equivariant Shape Completion via Anchor Point Encoding
LoRACLR: Contrastive Adaptation for Customization of Diffusion Models
Test-Time Visual In-Context Tuning
RelationField: Relate Anything in Radiance Fields
Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation
Zero-Shot Image Restoration Using Few-Step Guidance of Consistency Models (and Beyond)
Efficient Fine-Tuning and Concept Suppression for Pruned Diffusion Models
Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning
Towards Scalable Human-aligned Benchmark for Text-guided Image Editing
Gaussian Splatting Feature Fields for (Privacy-Preserving) Visual Localization
Video Depth without Video Models
Pose Priors from Language Models
Scaling Vision Pre-Training to 4K Resolution
CoLLM: A Large Language Model for Composed Image Retrieval
ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On
MESC-3D:Mining Effective Semantic Cues for 3D Reconstruction from a Single Image
Hybrid Reciprocal Transformer with Triplet Feature Alignment for Scene Graph Generation
Poly-Autoregressive Prediction for Modeling Interactions
Multi-Group Proportional Representations for Text-to-Image Models
Quaffure: Real-Time Quasi-Static Neural Hair Simulation
PGC: Physics-Based Gaussian Cloth from a Single Pose
Towards Efficient Foundation Model for Zero-shot Amodal Segmentation
Advancing Manga Analysis: Comprehensive Segmentation Annotations for the Manga109 Dataset
One Diffusion to Generate Them All
EditAR: Unified Conditional Generation with Autoregressive Models
CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models
Causal Composition Diffusion Model for Closed-loop Traffic Generation
ExpertAF: Expert Actionable Feedback from Video
Rethinking Training for De-biasing Text-to-Image Generation: Unlocking the Potential of Stable Diffusion
ForestLPR: LiDAR Place Recognition in Forests Attentioning Multiple BEV Density Images
MET3R: Measuring Multi-View Consistency in Generated Images
MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism
CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models
Training-free Neural Architecture Search through Variance of Knowledge of Deep Network Weights
LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living
ESC: Erasing Space Concept for Knowledge Deletion
Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding
Noise Calibration and Spatial-Frequency Interactive Network for STEM Image Enhancement
Blurred LiDAR for Sharper 3D: Robust Handheld 3D Scanning with Diffuse LiDAR and RGB
EZSR: Event-based Zero-Shot Recognition
Towards Source-Free Machine Unlearning
Instance-wise Supervision-level Optimization in Active Learning
Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation
Focusing on Tracks for Online Multi-Object Tracking
DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching
Is `Right' Right? Enhancing Object Orientation Understanding in Multimodal Large Language Models through Egocentric Instruction Tuning
Efficient Data Driven Mixture-of-Expert Extraction from Trained Networks
RL-RC-DoT: A Block-level RL agent for Task-Aware Video Compression
MAD: Memory-Augmented Detection of 3D Objects
CaMuViD: Calibration-Free Multi-View Detection
Boltzmann Attention Sampling for Image Analysis with Small Objects
Faster Parameter-Efficient Tuning with Token Redundancy Reduction
FALCON: Fairness Learning via Contrastive Attention Approach to Continual Semantic Scene Understanding
HybridGS: Decoupling Transients and Statics with 2D and 3D Gaussian Splatting
Multi-Scale Neighborhood Occupancy Masked Autoencoder for Self-Supervised Learning in LiDAR Point Clouds
MammAlps: A Multi-view Video Behavior Monitoring Dataset of Wild Mammals in the Swiss Alps
MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval
PBR-NeRF: Inverse Rendering with Physics-Based Neural Fields
One2Any: One-Reference 6D Pose Estimation for Any Object
MARBLE: Material Recomposition and Blending in CLIP-Space
MIRE: Matched Implicit Neural Representations
PolarFree: Polarization-based Reflection-Free Imaging
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
3D-MVP: 3D Multiview Pretraining for Manipulation
Layered Motion Fusion: Lifting Motion Segmentation to 3D in Egocentric Videos
Twinner: Shining Light on Digital Twins in a Few Snaps
VGGT: Visual Geometry Grounded Transformer
DualPM: Dual Posed-Canonical Point Maps for 3D Shape and Pose Reconstruction
3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes
PartGen: Part-level 3D Generation and Reconstruction with Multi-view Diffusion Models
UnCommon Objects in 3D
Good, Cheap, and Fast: Overfitted Image Compression with Wasserstein Distortion
MirrorVerse: Pushing Diffusion Models to Realistically Reflect the World
Compass Control: Multi Object Orientation Control for Text-to-Image Generation
Composing Parts for Expressive Object Generation
MonoPlace3D: Learning 3D-Aware Object Placement for 3D Monocular Detection
Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis
A Bias-Free Training Paradigm for More General AI-generated Image Detection
Geometry Field Splatting with Gaussian Surfels
Taxonomy-Aware Evaluation of Vision-Language Models
ERUPT: Efficient Rendering with Unposed Patch Transformer
TIDE: Training Locally Interpretable Domain Generalization Models Enables Test-time Correction
Image Reconstruction from Readout-Multiplexed Single-Photon Detector Arrays
PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models
Towards Unbiased and Robust Spatio-Temporal Scene Graph Generation and Anticipation
SplatFlow: Self-Supervised Dynamic Gaussian Splatting in Neural Motion Flow Field for Autonomous Driving
Edge-SD-SR: Low Latency and Parameter Efficient On-device Super-Resolution with Stable Diffusion via Bidirectional Conditioning
InterDyn: Controllable Interactive Dynamics with Video Diffusion Models
Disentangling Safe and Unsafe Image Corruptions via Anisotropy and Locality
Concept Lancet: Image Editing with Compositional Representation Transplant
Practical Solutions to the Relative Pose of Three Calibrated Cameras
A Regularization-Guided Equivariant Approach for Image Restoration
From Alexnet to Transformers: Measuring the Non-linearity of Deep Neural Networks with Affine Optimal Transport
A Unified Latent Schrödinger Bridge Diffusion Model for Unsupervised Anomaly Detection and Localization
LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping
Learning from Streaming Video with Orthogonal Gradients
Precise Event Spotting in Sports Videos: Solving Long-Range Dependency and Class Imbalance
Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content
Bias for Action: Video Implicit Neural Representations with Bias Modulation
Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models
Camouflage Anything: Learning to Hide using Controlled Out-painting and Representation Engineering
Zero-Shot Novel View and Depth Synthesis with Multi-View Geometric Diffusion
ZeroGrasp: Zero-Shot Shape Reconstruction Enabled Robotic Grasping
Segmenting Maxillofacial Structures in CBCT Volumes
Zero-Shot Styled Text Image Generation, but Make It Autoregressive
MAGiC-SLAM: Multi-Agent Gaussian Globally Consistent SLAM
h-Edit: Effective and Flexible Diffusion-Based Editing via Doob's h-Transform
Curriculum Direct Preference Optimization for Diffusion and Consistency Models
Instant3dit: Multiview Inpainting for Fast Editing of 3D Objects
DiSciPLE: Learning Interpretable Programs for Scientific Visual Discovery
DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment
Satellite to GroundScape - Large-scale Consistent Ground View Generation from Satellite Views
STEREO: A Two-Stage Framework for Adversarially Robust Concept Erasing from Text-to-Image Diffusion Models
Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models
GenDeg: Diffusion-based Degradation Synthesis for Generalizable All-In-One Image Restoration
Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning
SINR: Sparsity Driven Compressed Implicit Neural Representations
Distilling Multi-modal Large Language Models for Autonomous Driving
Attention IoU: Examining Biases in CelebA using Attention Maps
COBRA: COmBinatorial Retrieval Augmentation for Few-Shot Adaptation
Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events
Low-Rank Adaptation in Multilinear Operator Networks for Security-Preserving Incremental Learning
MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures
Preserving Clusters in Prompt Learning for Unsupervised Domain Adaptation
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation
ProHOC: Probabilistic Hierarchical Out-of-Distribution Classification via Multi-Depth Networks
VideoGEM: Training-free Action Grounding in Videos
LatentHOI: On the Generalizable Hand Object Motion Generation with Latent Hand Diffusion.
Noise-Resistant Video Anomaly Detection via RGB Error-Guided Multiscale Predictive Coding and Dynamic Memory
Investigating the Role of Weight Decay in Enhancing Nonconvex SGD
Exposure-slot: Exposure-centric Representations Learning with Slot-in-Slot Attention for Region-aware Exposure Correction
NoiseCtrl: A Sampling-Algorithm-Agnostic Conditional Generation Method for Diffusion Models
PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation
SemAlign3D: Semantic Correspondence between RGB-Images through Aligning 3D Object-Class Representations
FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video
Uncertainty Meets Diversity: A Comprehensive Active Learning Framework for Indoor 3D Object Detection
Omnidirectional Multi-Object Tracking
Generative Zero-Shot Composed Image Retrieval
Positive2Negative: Breaking the Information-Lossy Barrier in Self-Supervised Single Image Denoising
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts
MaIR: A Locality- and Continuity-Preserving Mamba for Image Restoration
DeSplat: Decomposed Gaussian Splatting for Distractor-Free Rendering
EDCFlow: Exploring Temporally Dense Difference Maps for Event-based Optical Flow Estimation
Finding Local Diffusion Schrödinger Bridge using Kolmogorov-Arnold Network
Reversible Decoupling Network for Single Image Reflection Removal
VideoDirector: Precise Video Editing via Text-to-Video Models
Deep Fair Multi-View Clustering with Attention KAN
iG-6DoF: Model-free 6DoF Pose Estimation for Unseen Object via Iterative 3D Gaussian Splatting
Reducing Class-wise Confusion for Incremental Learning with Disentangled Manifolds
COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation
Dual Prompting Image Restoration with Diffusion Transformers
Shining Yourself: High-Fidelity Ornaments Virtual Try-on with Diffusion Model
Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis
STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection
ReWind: Understanding Long Videos with Instructed Learnable Memory
Zero-shot 3D Question Answering via Voxel-based Dynamic Token Compression
Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models
UVGS: Reimagining Unstructured 3D Gaussian Splatting using UV Mapping
IceDiff: High Resolution and High-Quality Arctic Sea Ice Forecasting with Generative Diffusion Prior
Where's the Liability in the Generative Era? Recovery-based Black-Box Detection of AI-Generated Content
VisionArena: 230k Real World User-VLM Conversations with Preference Labels
Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation
BFANet: Revisiting 3D Semantic Segmentation with Boundary Feature Analysis
GCE-Pose: Global Context Enhancement for Category-level Object Pose Estimation
Z-Magic: Zero-shot Multiple Attributes Guided Image Creator
DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles
ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting
PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability
Weakly Supervised Semantic Segmentation via Progressive Confidence Region Expansion
F-LMM: Grounding Frozen Large Multimodal Models
Joint Optimization of Neural Radiance Fields and Continuous Camera Motion from a Monocular Video
AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos
Image Generation Diversity Issues and How to Tame Them
Detail-Preserving Latent Diffusion for Stable Shadow Removal
FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training
LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models
LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding
OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation
AMO Sampler: Enhancing Text Rendering with Overshooting
FoundationStereo: Zero-Shot Stereo Matching
MonoTAKD: Teaching Assistant Knowledge Distillation for Monocular 3D Object Detection
JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration
EMOE: Modality-Specific Enhanced Dynamic Emotion Experts
Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning
SEAL: Semantic Attention Learning for Long Video Representation
FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs
ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models
Keep the Balance: A Parameter-Efficient Symmetrical Framework for RGB+X Semantic Segmentation
Distilling Monocular Foundation Model for Fine-grained Depth Completion
RainyGS: Efficient Rain Synthesis with Physically-Based Gaussian Splatting
JiSAM: Alleviate Labeling Burden and Corner Case Problems in Autonomous Driving via Minimal Real-World Data
Light Transport-aware Diffusion Posterior Sampling for Single-View Reconstruction of 3D Volumes
Beyond Background Shift: Rethinking Instance Replay in Continual Semantic Segmentation
XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?
Efficient Personalization of Quantized Diffusion Model without Backpropagation
SUM Parts: Benchmarking Part-Level Semantic Segmentation of Urban Meshes
Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues
Nearly Zero-Cost Protection Against Mimicry by Personalized Diffusion Models
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
PTDiffusion: Free Lunch for Generating Optical Illusion Hidden Pictures with Phase-Transferred Diffusion Model
4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion
Temporal Alignment-Free Video Matching for Few-shot Action Recognition
Foveated Instance Segmentation
Zero-Shot Head Swapping in Real-World Scenarios
VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models
SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis
Satellite Observations Guided Diffusion Model for Accurate Meteorological States at Arbitrary Resolution
Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision
Dual Exposure Stereo for Extended Dynamic Range 3D Imaging
DRAWER: Digital Reconstruction and Articulation With Environment Realism
BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations
Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds
Learned Binocular-Encoding Optics for RGBD Imaging Using Joint Stereo and Focus Cues
Improving Sound Source Localization with Joint Slot Attention on Image and Audio
Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors
Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI
ABBSPO: Adaptive Bounding Box Scaling and Symmetric Prior based Orientation Prediction for Detecting Aerial Image Objects
Towards Lossless Implicit Neural Representation via Bit Plane Decomposition
Apply Hierarchical-Chain-of-Generation to Complex Attributes Text-to-3D Generation
DefectFill: Realistic Defect Generation with Inpainting Diffusion Model for Visual Inspection
Multi-Modal Aerial-Ground Cross-View Place Recognition with Neural ODEs
MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting
HOT: Hadamard-based Optimized Training
Sampling Innovation-Based Adaptive Compressive Sensing
Scalable Autoregressive Monocular Depth Estimation
DiskVPS: Vanishing Point Detector via Hough Transform in a Disk Region
Minimal Interaction Seperated Tuning: A New Paradigm for Visual Adaptation
Align-A-Video: Deterministic Reward Tuning of Image Diffusion Models for Consistent Video Editing
Three-view Focal Length Recovery From Homographies
FreeScene: Mixed Graph Diffusion for 3D Scene Synthesis from Free Prompts
DriveScape: High-Resolution Driving Video Generation by Multi-View Feature Fusion
DistinctAD: Distinctive Audio Description Generation in Contexts
Towards Autonomous Micromobility through Scalable Urban Simulation
Synthetic Data is an Elegant GIFT for Continual Vision-Language Models
Event-Equalized Dense Video Captioning
MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction
dFLMoE: Decentralized Federated Learning via Mixture of Experts for Medical Data Analysis
Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding
3D-AVS: LiDAR-based 3D Auto-Vocabulary Segmentation
LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis
RobSense: A Robust Multi-modal Foundation Model for Remote Sensing with Static, Temporal, and Incomplete Data Adaptability
SeCap: Self-Calibrating and Adaptive Prompts for Cross-view Person Re-Identification in Aerial-Ground Networks
Learning Conditional Space-Time Prompt Distributions for Video Class-Incremental Learning
Dual-Agent Optimization framework for Cross-Domain Few-Shot Segmentation
HSI-GPT: A General-Purpose Large Scene-Motion-Language Model for Human Scene Interaction
DFM: Differentiable Feature Matching for Anomaly Detection
Towards Precise Embodied Dialogue Localization via Causality Guided Diffusion
BiM-VFI: Bidirectional Motion Field-Guided Frame Interpolation for Video with Non-uniform Motions
BimArt: A Unified Approach for the Synthesis of 3D Bimanual Interaction with Articulated Objects
Chain of Semantics Programming in 3D Gaussian Splatting Representation for 3D Vision Grounding
Generating Multimodal Driving Scenes via Next-Scene Prediction
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
Link-based Contrastive Learning for One-Shot Unsupervised Domain Adaptation
MEGA: Masked Generative Autoencoder for Human Mesh Recovery
Scene Map-based Prompt Tuning for Navigation Instruction Generation
EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation
MMRL: Multi-Modal Representation Learning for Vision-Language Models
Label Shift Meets Online Learning: Ensuring Consistent Adaptation with Universal Dynamic Regret
ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation
Anyattack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-language Models
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation
Rethinking Noisy Video-Text Retrieval via Relation-aware Alignment
SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes
Shape and Texture: What Influences Reliable Optical Flow Estimation?
MC^2: Multi-concept Guidance for Customized Multi-concept Generation
KAC: Kolmogorov-Arnold Classifier for Continual Learning
Spherical Manifold Guided Diffusion Model for Panoramic Image Generation
A Unified Image-Dense Annotation Generation Model for Underwater Scenes
SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation
Unified Dense Prediction of Video Diffusion
Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene
Precise, Fast, and Low-cost Concept Erasure in Value Space: Orthogonal Complement Matters
Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction
VTON-HandFit: Virtual Try-on for Arbitrary Hand Pose Guided by Hand Priors Embedding
CamFreeDiff: Camera-free Image to Panorama Generation with Diffusion Model
MobileMamba: Lightweight Multi-Receptive Visual Mamba Network
SDBF: Steep-Decision-Boundary Fingerprinting for Hard-Label Tampering Detection of DNN Models
M-LLM Based Video Frame Selection for Efficient Video Understanding
DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning
Empowering Vector Graphics with Consistently Arbitrary Viewing and View-dependent Visibility
Activating Sparse Part Concepts for 3D Class Incremental Learning
RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete
Uncertainty-Instructed Structure Injection for Generalizable HD Map Construction
StoryGPT-V: Large Language Models as Consistent Story Visualizers
Robotic Visual Instruction
From Head to Tail: Efficient Black-box Model Inversion Attack via Long-tailed Learning
From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration
Learning to Detect Objects from Multi-Agent LiDAR Scans without Manual Labels
Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching
Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations
FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion
Improving Gaussian Splatting with Localized Points Management
DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos
RayFlow: Instance-Aware Diffusion Acceleration via Adaptive Flow Trajectories
Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks
Rotation-Equivariant Self-Supervised Method in Image Denoising
MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking
Spatiotemporal Decoupling for Efficient Vision-Based Occupancy Forecasting
TopoCellGen: Generating Histopathology Cell Topology with a Diffusion Model
MOS-Attack: A Scalable Multi-objective Adversarial Attack Framework
MambaOut: Do We Really Need Mamba for Vision?
Diffusion Model is Effectively Its Own Teacher
Scaling Down Text Encoders of Text-to-Image Diffusion Models
AVQACL: A Novel Benchmark for Audio-Visual Question Answering Continual Learning
Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond
Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Large Model Enhancement
Exploring Scene Affinity for Semi-Supervised LiDAR Semantic Segmentation
HOIGPT: Learning Long-Sequence Hand-Object Interaction with Language Models
Science-T2I: Addressing Scientific Illusions in Image Synthesis
EAP-GS: Efficient Augmentation of Pointcloud for 3D Gaussian Splatting in Few-shot Scene Reconstruction
AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment
V2X-R: Cooperative LiDAR-4D Radar Fusion with Denoising Diffusion for 3D Object Detection
CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution
AutoLUT: LUT-Based Image Super-Resolution with Automatic Sampling and Adaptive Residual Learning
PersonaHOI: Effortlessly Improving Face Personalization in Human-Object Interaction Generation
TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models
Imputation-free and Alignment-free: Incomplete Multi-view Clustering Driven by Consensus Semantic Learning
EASEMVC:Efficient Dual Selection Mechanism for Deep Multi-View Clustering
DyMO: Training-Free Diffusion Model Alignment with Dynamic Multi-Objective Scheduling
BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance
DNF: Unconditional 4D Generation with Dictionary-based Neural Fields
Blind Bitstream-corrupted Video Recovery via Metadata-guided Diffusion Model
M3GYM: A Large-Scale Multimodal Multi-view Multi-person Pose Dataset for Fitness Activity Understanding in Real-world Settings
AKiRa: Augmentation Kit on Rays for Optical Video Generation
Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval
Exploration-Driven Generative Interactive Environments
StyleSSP: Sampling StartPoint Enhancement for Training-free Diffusion-based Method for Style Transfer
SoftShadow: Leveraging Soft Masks for Penumbra-Aware Shadow Removal
PointSR: Self-Regularized Point Supervision for Drone-View Object Detection
Towards RAW Object Detection in Diverse Conditions
Deformable Radial Kernel Splatting
SCAP: Transductive Test-Time Adaptation via Supportive Clique-based Attribute Prompting
WonderWorld: Interactive 3D Scene Generation from a Single Image
POMP: Physics-constrainable Motion Generative Model through Phase Manifolds
Hearing Anywhere in Any Environment
ICP: Immediate Compensation Pruning for Mid-to-high Sparsity
PlanarSplatting: Accurate Planar Surface Reconstruction in 3 Minutes
Parameterized Blur Kernel Prior Learning for Local Motion Deblurring
UniNet: A Contrastive Learning-guided Unified Framework with Feature Selection for Anomaly Detection
Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map
Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification
Optimizing for the Shortest Path in Denoising Diffusion Model
AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation
Learning Partonomic 3D Reconstruction from Image Collections
MDP: Multidimensional Vision Model Pruning with Latency Constraint
ACL: Activating Capability of Linear Attention for Image Restoration
MOS: Modeling Object-Scene Associations in Generalized Category Discovery
MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision
Blurry-Edges: Photon-Limited Depth Estimation from Defocused Boundaries
Luminance-GS: Adapting 3D Gaussian Splatting to Challenging Lighting Conditions with View-Adaptive Curve Adjustment
IDEA-Bench: How Far are Generative Models from Professional Designing?
Transformers without Normalization
Enhancing Dataset Distillation via Non-Critical Region Refinement
Few-shot Implicit Function Generation via Equivariance
POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation
PMNI: Pose-free Multi-view Normal Integration for Reflective and Textureless Surface Reconstruction
Segment Any Motion in Videos
Bridging Viewpoint Gaps: Geometric Reasoning Boosts Semantic Correspondence
GenFusion: Closing the Loop between Reconstruction and Generation via Videos
Lux Post Facto: Learning Portrait Performance Relighting with Conditional Video Diffusion and a Hybrid Dataset
MIMO: A Medical Vision Language Model with Visual Referring Multimodal Input and Pixel Grounding Multimodal Output
StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements
GS-2DGS: Geometrically Supervised 2DGS for Reflective Object Reconstruction
CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation
From Slow Bidirectional to Fast Autoregressive Video Diffusion Models
R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization
PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning
Shift the Lens: Environment-Aware Unsupervised Camouflaged Object Detection
Logits DeConfusion with CLIP for Few-Shot Learning
Bridging Gait Recognition and Large Language Models Sequence Modeling
Unbiased Video Scene Graph Generation via Visual and Semantic Dual Debiasing
DiffCAM: Data-Driven Saliency Maps by Capturing Feature Differences
G3Flow: Generative 3D Semantic Flow for Pose-aware and Generalizable Object Manipulation
HumanRig: Learning Automatic Rigging for Humanoid Character in a Large Scale Dataset
PHGC: Procedural Heterogeneous Graph Completion for Natural Language Task Verification in Egocentric Videos
When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning
DEFOM-Stereo: Depth Foundation Model Based Stereo Matching
Language-Guided Audio-Visual Learning for Long-Term Sports Assessment
VidTwin: Video VAE with Decoupled Structure and Dynamics
3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination
EventFly: Event Camera Perception from Ground to the Sky
Semantic and Sequential Alignment for Referring Video Object Segmentation
Minding Fuzzy Regions: A Data-driven Alternating Learning Paradigm for Stable Lesion Segmentation
Adversarial Domain Prompt Tuning and Generation for Single Domain Generalization
AI-Face: A Million-Scale Demographically Annotated AI-Generated Face Dataset and Fairness Benchmark
3D Gaussian Head Avatars with Expressive Dynamic Appearances by Compact Tensorial Representations
Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction
Glossy Object Reconstruction with Cost-effective Polarized Acquisition
Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation
SACB-Net: Spatial-awareness Convolutions for Medical Image Registration
LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting
Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection
CraftsMan3D: High-fidelity Mesh Generation with 3D Native Diffusion and Interactive Geometry Refiner
Generative Sparse-View Gaussian Splatting
ProjAttacker: A Configurable Physical Adversarial Attack for Face Recognition via Projector
ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems
Doppelgangers++: Improved Visual Disambiguation with Geometric 3D Features
ArticulatedGS: Self-supervised Digital Twin Modeling of Articulated Objects using 3D Gaussian Splatting
Hash3D: Training-free Acceleration for 3D Generation
Progressive Focused Transformer for Single Image Super-Resolution
Learned Image Compression with Dictionary-based Entropy Model
Reasoning to Attend: Try to Understand How
Token Works
DRiVE: Diffusion-based Rigging Empowers Generation of Versatile and Expressive Characters
Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces
GIVEPose: Gradual Intra-class Variation Elimination for RGB-based Category-Level Object Pose Estimation
Towards Understanding How Knowledge Evolves in Large Vision-Language Models
PatchVSR: Breaking Video Diffusion Resolution Limits with Patch-wise Video Super-Resolution
Dynamic Group Normalization: Spatio-Temporal Adaptation to Evolving Data Statistics
SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance
RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models
UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines
Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation
Flow-NeRF: Joint Learning of Geometry, Poses, and Dense Flow within Unified Neural Representations
HIIF: Hierarchical Encoding based Implicit Image Function for Continuous Super-resolution
AMR-Transformer: Enabling Efficient Long-range Interaction for Complex Neural Fluid Simulation
AToM: Aligning Text-to-Motion Model at Event-Level with GPT-4Vision Reward
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning
SketchAgent: Language-Driven Sequential Sketch Generation
DocVLM: Make Your VLM an Efficient Reader
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models
DynPose: Largely Improving the Efficiency of Human Pose Estimation by a Simple Dynamic Framework
Type-R: Automatically Retouching Typos for Text-to-Image Generation
HoGS: Unified Near and Far Object Reconstruction via Homogeneous Gaussian Splatting
CoSDH: Communication-Efficient Collaborative Perception via Supply-Demand Awareness and Intermediate-Late Hybridization
MNE-SLAM: Multi-Agent Neural SLAM for Mobile Robots
HuMoCon: Concept Discovery for Human Motion Understanding
Reloc3r: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization
SeaLion: Semantic Part-Aware Latent Point Diffusion Models for 3D Generation
Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability
Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning
Hierarchical Flow Diffusion for Efficient Frame Interpolation
Multi-party Collaborative Attention Control for Image Customization
ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning
Boosting the Dual-Stream Architecture in Ultra-High Resolution Segmentation with Resolution-Biased Uncertainty Estimation
PS-Diffusion: Photorealistic Subject-Driven Image Editing with Disentangled Control and Attention
No Pains, More Gains: Recycling Sub-Salient Patches for Efficient High-Resolution Image Recognition
TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception
ROS-SAM: High-Quality Interactive Segmentation for Remote Sensing Moving Object
Libra-Merging: Importance-redundancy and Pruning-merging Trade-off for Acceleration Plug-in in Large Vision-Language Model
SEC-Prompt:SEmantic Complementary Prompting for Few-Shot Class-Incremental Learning
Sonata: Self-Supervised Learning of Reliable Point Representations
IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation
Separation of Powers: On Segregating Knowledge from Observation in LLM-enabled Knowledge-based Visual Question Answering
LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models
Person De-reidentification: A Variation-guided Identity Shift Modeling
MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection Dataset for Tiny Objects
DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models
RORem: Training a Robust Object Remover with Human-in-the-Loop
Leveraging Perturbation Robustness to Enhance Out-of-Distribution Detection
MetaWriter: Personalized Handwritten Text Recognition Using Meta-Learned Prompt Tuning
LineArt: A Knowledge-guided Training-free High-quality Appearance Transfer for Design Drawing with Diffusion Model
Progress-Aware Video Frame Captioning
SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding
Multi-Label Prototype Visual Spatial Search for Weakly Supervised Semantic Segmentation
Multi-modal Contrastive Learning with Negative Sampling Calibration for Phenotypic Drug Discovery
Imagine and Seek: Improving Composed Image Retrieval with an Imagined Proxy
DiffVsgg: Diffusion-Driven Online Video Scene Graph Generation
Adapting Text-to-Image Generation with Feature Difference Instruction for Generic Image Restoration
Multimodal Autoregressive Pre-training of Large Vision Encoders
Efficient Diffusion as Low Light Enhancer
ADD: Attribution-Driven Data Augmentation Framework for Boosting Image Super-Resolution
On the Out-Of-Distribution Generalization of Large Multimodal Models
MobileH2R: Learning Generalizable Human to Mobile Robot Handover Exclusively from Scalable and Diverse Synthetic Data
Mono3DVLT: Monocular-Video-Based 3D Visual Language Tracking
Bridging the Gap between Gaussian Diffusion Models and Universal Quantization for Image Compression
MaRI: Material Retrieval Integration across Domains
Gain from Neighbors: Boosting Model Robustness in the Wild via Adversarial Perturbations Toward Neighboring Classes
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
EnliveningGS: Active Locomotion of 3DGS
ARM: Appearance Reconstruction Model for Relightable 3D Generation
Image Referenced Sketch Colorization Based on Animation Creation Workflow
The Change You Want To Detect: Semantic Change Detection In Earth Observation With Hybrid Data Generationf
SphereUFormer: A U-Shaped Transformer for Spherical 360 Perception
SET: Spectral Enhancement for Tiny Object Detection
LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
Towards Consistent Multi-Task Learning: Unlocking the Potential of Task-Specific Parameters
Revisiting Fairness in Multitask Learning: A Performance-Driven Approach for Variance Reduction
RoboSense: Large-scale Dataset and Benchmark for Egocentric Robot Perception and Navigation in Crowded and Unstructured Environments
UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection
High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model
Bridge Frame and Event: Common Spatiotemporal Fusion for High-Dynamic Scene Optical Flow
Detection-Friendly Nonuniformity Correction: A Union Framework for Infrared UAV Target Detection
TimeTracker: Event-based Continuous Point Tracking for Video Frame Interpolation with Non-linear Motion
RoboPEPP: Vision-Based Robot Pose and Joint Angle Estimation through Embedding Predictive Pre-Training
Rate-In: Information-Driven Adaptive Dropout Rates for Improved Inference-Time Uncertainty Estimation
AA-CLIP: Enhancing Zero-Shot Anomaly Detection via Anomaly-Aware CLIP
Deep Change Monitoring: A Hyperbolic Representative Learning Framework and a Dataset for Long-term Fine-grained Tree Change Detection
Medusa: A Multi-Scale High-order Contrastive Dual-Diffusion Approach for Multi-View Clustering
Concept Replacer: Replacing Sensitive Concepts in Diffusion Models via Precision Localization
HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation
MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model
CustAny: Customizing Anything from A Single Example
Synthetic-to-Real Self-supervised Robust Depth Estimation via Learning with Motion and Structure Priors
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training
Incomplete Multi-View Multi-label Learning via Disentangled Representation and Label Semantic Embedding
RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion
OccMamba: Semantic Occupancy Prediction with State Space Models
MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling
Multi-view Reconstruction via SfM-guided Monocular Depth Estimation
Learnable Infinite Taylor Gaussian for Dynamic View Rendering
Consistency Posterior Sampling for Diverse Image Synthesis
ChatHuman: Chatting about 3D Humans with Tools
Language Guided Concept Bottleneck Models for Interpretable Continual Learning
Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion
Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models
Hierarchical Gaussian Mixture Model Splatting for Efficient and Part Controllable 3D Generation
Feature Information Driven Position Gaussian Distribution Estimation for Tiny Object Detection
Open-World Objectness Modeling Unifies Novel Object Detection
NN-Former: Rethinking Graph Structure in Neural Architecture Representation
Video Language Model Pretraining with Spatio-temporal Masking
Building Vision Models upon Heat Conduction
AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing
Unity in Diversity: Video Editing via Gradient-Latent Purification
VideoSPatS: Video SPatiotemporal Splines for Disentangled Occlusion, Appearance and Motion Modeling and Editing
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level
FRESA: Feedforward Reconstruction of Personalized Skinned Avatars from Few Images
GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control
Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attack on Breast Ultrasound Images
NoT: Federated Unlearning via Weight Negation
ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models
Gromov–Wasserstein Problem with Cyclic Symmetry
MeshArt: Generating Articulated Meshes with Structure-Guided Transformers
TSD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution
Towards Universal Soccer Video Understanding
One-Way Ticket: Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models
Multi-modal Medical Diagnosis via Large-small Model Collaboration
Chat-based Person Retrieval via Dialogue-Refined Cross-Modal Alignment
Fish-Vista: A Multi-Purpose Dataset for Understanding & Identification of Traits from Images
A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation
Exploring Historical Information for RGBE Visual Tracking with Mamba
Active Hyperspectral Imaging Using an Event Camera
EventPSR: Surface Normal and Reflectance Estimation from Photometric Stereo Using an Event Camera
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models?
Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning
DecoupledGaussian: Object-Scene Decoupling for Physics-Based Interaction
Do Computer Vision Foundation Models Learn the Low-level Characteristics of the Human Visual System?
Sound Bridge: Associating Egocentric and Exocentric Videos via Audio Cues
Trajectory Mamba: Efficient Attention-Mamba Forecasting Model Based on Selective SSM
Resilient Sensor Fusion Under Adverse Sensor Failures via Multi-Modal Expert Fusion
DiC: Rethinking Conv3x3 Designs in Diffusion Models
SAM-REF: Introducing Image-Prompt Synergy during Interaction for Detail Enhancement in the Segment Anything Model
LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
InteractionMap: Improving Online Vectorized HDMap Construction with Interaction
Can't Slow Me Down: Learning Robust and Hardware-Adaptive Object Detectors against Latency Attacks for Edge Devices
Scaling Inference Time Compute for Diffusion Models
3D Gaussian Inpainting with Depth-Guided Cross-View Consistency
UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing
Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration
Sparse Voxels Rasterization: Real-time High-fidelity Radiance Field Rendering
VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models
Learning Endogenous Attention for Incremental Object Detection
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Point-Cache: Test-time Dynamic and Hierarchical Cache for Robust and Generalizable Point Cloud Analysis
FluxSpace: Disentangled Semantic Editing in Rectified Flow Models
SKE-Layout: Spatial Knowledge Enhanced Layout Generation with LLMs
Think Small, Act Big: Primitive Prompt Learning for Lifelong Robot Manipulation
A Hubness Perspective on Representation Learning for Graph-Based Multi-View Clustering
Advancing Generalizable Tumor Segmentation with Anomaly-Aware Open-Vocabulary Attention Maps and Frozen Foundation Diffusion Models
Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception
TAROT: Towards Essentially Domain-Invariant Robustness with Theoretical Justification
S2Gaussian: Sparse-View Super-Resolution 3D Gaussian Splatting
Towards Million-Scale Adversarial Robustness Evaluation With Stronger Individual Attacks
Scene4U: Hierarchical Layered 3D Scene Reconstruction from Single Panoramic Image for Your Immerse Exploration
Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics
Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment
Data-free Universal Adversarial Perturbation with Pseudo-semantic Prior
OmniStyle: Filtering High Quality Style Transfer Data at Scale
OpticalNet: An Optical Imaging Dataset and Benchmark Beyond the Diffraction Limit
SimAvatar: Simulation-Ready Avatars with Layered Hair and Clothing
BLADE: Single-view Body Mesh Estimation through Accurate Depth Estimation
Towards Improved Text-Aligned Codebook Learning: Multi-Hierarchical Codebook-Text Alignment with Long Text
AlphaPre: Amplitude-Phase Disentanglement Model for Precipitation Nowcasting
Sensitivity-Aware Efficient Fine-Tuning via Compact Dynamic-Rank Adaptation
StyleMaster: Stylize Your Video with Artistic Generation and Translation
GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs
Overcoming Shortcut Problem in VLM for Robust Out-of-Distribution Detection
Divide and Conquer: Heterogeneous Noise Integration for Diffusion-based Adversarial Purification
MonoDGP: Monocular 3D Object Detection with Decoupled-Query and Geometry-Error Priors
PCDreamer: Point Cloud Completion Through Multi-view Diffusion Priors
Autoregressive Sequential Pretraining for Visual Tracking
Boosting Domain Incremental Learning: Selecting the Optimal Parameters is All You Need
One-Step Event-Driven High-Speed Autofocus
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning
From Prototypes to General Distributions: An Efficient Curriculum for Masked Image Modeling
DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation
VLMs-Guided Representation Distillation for Efficient Vision-Based Reinforcement Learning
Is this Generated Person Existed in Real-world? Fine-grained Detecting and Calibrating Abnormal Human-body
DreamRelation: Bridging Customization and Relation Generation
SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking
FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation
APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers
3D-SLNR: A Super Lightweight Neural Representation for Large-scale 3D Mapping
Repurposing Pre-trained Video Diffusion Models for Event-based Video Interpolation
Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows
Enhancing Online Continual Learning with Plug-and-Play State Space Model and Class-Conditional Mixture of Discretization
Effective Cloud Removal for Remote Sensing Images by an Improved Mean-Reverting Denoising Model with Elucidated Design Space
HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation
MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception
S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Model with Spatio-Temporal Visual Representation
The Photographer's Eye: Teaching Multimodal Large Language Models to See, and Critique Like Photographers
Relative Pose Estimation through Affine Corrections of Monocular Depth Priors
DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation
pFedMxF: Personalized Federated Class-Incremental Learning with Mixture of Frequency Aggregation
BiLoRA: Almost-Orthogonal Parameter Spaces for Continual Learning
Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis and Manipulation
Structure-from-Motion with a Non-Parametric Camera Model
MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting
Animate and Sound an Image
Towards Enhanced Image Inpainting: Mitigating Unwanted Object Insertion and Preserving Color Consistency
ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning
MAC-Ego3D: Multi-Agent Gaussian Consensus for Real-Time Collaborative Ego-Motion and Photorealistic 3D Reconstruction
BrepGiff: Lightweight Generation of Complex B-rep with 3D GAT Diffusion
Split Adaptation for Pre-trained Vision Transformers
Estimating Body and Hand Motion in an Ego‑sensed World
HiMoR: Monocular Deformable Gaussian Reconstruction with Hierarchical Motion Representation
BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models
EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision
Spatial-Temporal Graph Diffusion Policy with Kinematic Modeling for Bimanual Robotic Manipulation
De^2Gaze: Deformable and Decoupled Representation Learning for 3D Gaze Estimation
Exploring Simple Open-Vocabulary Semantic Segmentation
FactCheXcker: Mitigating Measurement Hallucinations in Chest X-ray Report Generation Models
POSTA: A Go-to Framework for Customized Artistic Poster Generation
SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos
TFCustom: Customized Image Generation with Time-Aware Frequency Feature Guidance
RePerformer: Immersive Human-centric Volumetric Videos from Playback to Photoreal Reperformance
Robust Multimodal Survival Prediction with Conditional Latent Differentiation Variational AutoEncoder
Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World
DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation
From Zero to Detail: Deconstructing Ultra-High-Definition Image Restoration from Progressive Spectral Perspective
UniVAD: A Training-free Unified Model for Few-shot Visual Anomaly Detection
Mimic In-Context Learning for Multimodal Tasks
ROD-MLLM: Towards More Reliable Object Detection in Multimodal Large Language Models
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
Towards Satellite Image Road Graph Extraction: A Global-Scale Dataset and A Novel Method
Towards Cost-Effective Learning: A Synergy of Semi-Supervised and Active Learning
DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes
Calibrated Multi-Preference Optimization for Aligning Diffusion Models
MotiF: Making Text Count in Image Animation with Motion Focal Loss
ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer
Zero-Shot Monocular Scene Flow Estimation in the Wild
VideoGigaGAN: Towards Detail-rich Video Super-Resolution
Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise
MeshGen: Generating PBR Textured Mesh with Render-Enhanced Auto-Encoder and Generative Data Augmentation
Learning Visual Composition through Improved Semantic Guidance
Dynamic Motion Blending for Versatile Motion Editing
EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis
High Dynamic Range Video Compression: A Large-Scale Benchmark Dataset and A Learned Bit-depth Scalable Compression Algorithm
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations
Mani-GS: Gaussian Splatting Manipulation with Triangular Mesh
Adaptive Parameter Selection for Tuning Vision-Language Models
EchoMatch: Partial-to-Partial Shape Matching via Correspondence Reflection
DL2G: Degradation-guided Local-to-Global Restoration for Eyeglass Reflection Removal
Shape My Moves: Text-Driven Shape-Aware Synthesis of Human Motions
UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping
SAT-HMR: Real-Time Multi-Person 3D Mesh Estimation via Scale-Adaptive Tokens
FreeCloth: Free-form Generation Enhances Challenging Clothed Human Modeling
InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing
OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels
SegMAN: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation
BOE-ViT: Boosting Orientation Estimation with Equivariance in Self-Supervised 3D Subtomogram Alignment
RestorGS: Depth-aware Gaussian Splatting for Efficient 3D Scene Restoration
SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding
TCFG: Tangential Damping Classifier-free Guidance
Traversing Distortion-Perception Tradeoff using a Single Score-Based Generative Model
LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation
FLAVC: Learned Video Compression with Feature Level Attention
UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation
VoteFlow: Enforcing Local Rigidity in Self-Supervised Scene Flow
ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning
LOD-GS: Achieving Levels of Detail using Scalable Gaussian Soup
Embodied Scene Understanding for Vision Language Models via MetaVQA
HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models
Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning
V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents
AniDoc: Animation Creation Made Easier
Remote Photoplethysmography in Real-World and Extreme Lighting Scenarios
SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training
Rethinking Diffusion for Text-Driven Human Motion Generation: Redundant Representations, Evaluation, and Masked Autoregression
UNICL-SAM: Uncertainty-Driven In-Context Segmentation with Part Prototype Discovery
Efficient Decoupled Feature 3D Gaussian Splatting via Hierarchical Compression
Dual-Granularity Semantic Guided Sparse Routing Diffusion Model for General Pansharpening
Low-Biased General Annotated Dataset Generation
StdGEN: Semantic-Decomposed 3D Character Generation from Single Images
Make It Count: Text-to-Image Generation with an Accurate Number of Objects
STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
Latent Space Imaging
MUSt3R: Multi-view Network for Stereo 3D Reconstruction
Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark
Articulated Kinematics Distillation from Video Diffusion Models
Knowledge Memorization and Rumination for Pre-trained Model-based Class-Incremental Learning
Temporal Action Detection Model Compression by Progressive Block Drop
ControlFace: Harnessing Facial Parametric Control for Face Rigging
FinePhys: Fine-grained Human Action Generation by Explicitly Incorporating Physical Laws for Effective Skeletal Guidance
VEU-Bench: Towards Comprehensive Understanding of Video Editing
TIMotion: Temporal and Interactive Framework for Efficient Human-Human Motion Generation
Design2GarmentCode: Turning Design Concepts to Tangible Garments Through Program Synthesis
M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation
SATA: Spatial Autocorrelation Token Analysis for Enhancing the Robustness of Vision Transformers
An End-to-End Robust Point Cloud Semantic Segmentation Network with Single-Step Conditional Diffusion Models
GRAE-3DMOT: Geometry Relation-Aware Encoder for Online 3D Multi-Object Tracking
SOAP: Vision-Centric 3D Semantic Scene Completion with Scene-Adaptive Decoder and Occluded Region-Aware View Projection
Vision-Language Gradient Descent-driven All-in-One Deep Unfolding Networks
Binarized Mamba-Transformer for Lightweight Quad Bayer HybridEVS Demosaicing
CASP: Compression of Large Multimodal Models Based on Attention Sparsity
TriTex: Learning Texture from a Single Mesh via Triplane Semantic Features
Transfer Your Perspective: Controllable 3D Generation from Any Viewpoint in a Driving Scene
Vision-Language Models Do Not Understand Negation
Two is Better than One: Efficient Ensemble Defense for Robust and Compact Models
Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision
Leveraging Temporal Cues for Semi-Supervised Multi-View 3D Object Detection
HyperPose: Hypernetwork-Infused Camera Pose Localization and an Extended Cambridge Landmarks Dataset
VladVA: Discriminative Fine-tuning of LVLMs
RENO: Real-Time Neural Compression for 3D LiDAR Point Clouds
FreeUV: Ground-Truth-Free Realistic Facial UV Texture Recovery via Cross-Assembly Inference Strategy
CoMapGS: Covisibility Map-based Gaussian Splatting for Sparse Novel View Synthesis
Recovering Dynamic 3D Sketches from Videos
SoMA: Singular Value Decomposed Minor Components Adaptation for Domain Generalizable Representation Learning
PIDLoc: Cross-View Pose Optimization Network Inspired by PID Controllers
5%>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks
Automatic Spectral Calibration of Hyperspectral Images: Method, Dataset and Benchmark
Robust 3D Shape Reconstruction in Zero-Shot from a Single Image in the Wild
DiffLocks: Generating 3D Hair from a Single Image using Diffusion Models
COB-GS: Clear Object Boundaries in 3DGS Segmentation Based on Boundary-Adaptive Gaussian Splitting
CLIP-driven Coarse-to-fine Semantic Guidance for Fine-grained Open-set Semi-supervised Learning
DeRS: Towards Extremely Efficient Upcycled Mixture-of-Experts Models
RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins
PI-HMR: Towards Robust In-bed Temporal Human Shape Reconstruction with Contact Pressure Sensing
Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks
Hiding Images in Diffusion Models by Editing Learned Score Functions
Mind the Gap: Confidence Discrepancy Can Guide Federated Semi-Supervised Learning Across Pseudo-Mismatch
BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing
DefMamba: Deformable Visual State Space Model
Relation-Rich Visual Document Generator for Visual Information Extraction
Post-pre-training for Modality Alignment in Vision-Language Foundation Models
Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing
RivuletMLP: An MLP-based Architecture for Efficient Compressed Video Quality Enhancement
NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training
Dense Dispersed Structured Light for Hyperspectral 3D Imaging of Dynamic Scenes
Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation
SPARS3R: Semantic Prior Alignment and Regularization for Sparse 3D Reconstruction
ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts
Enhancing Facial Privacy Protection via Weakening Diffusion Purification
ABC-Former: Auxiliary Bimodal Cross-domain Transformer with Interactive Channel Attention for White Balance
PSHuman: Photorealistic Single-image 3D Human Reconstruction using Cross-Scale Multiview Diffusion and Explicit Remeshing
Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception
ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices
Reversing Flow for Image Restoration
Knowledge-Aligned Counterfactual-Enhancement Diffusion Perception for Unsupervised Cross-Domain Visual Emotion Recognition
Consistent and Controllable Image Animation with Motion Diffusion Models
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
MoST: Efficient Monarch Sparse Tuning for 3D Representation Learning
Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers
Towards Explicit Geometry-Reflectance Collaboration for Generalized LiDAR Segmentation in Adverse Weather
Where the Devil Hides: Deepfake Detectors Can No Longer Be Trusted
Fancy123: One Image to High-Quality 3D Mesh Generation via Plug-and-Play Deformation
SASep: Saliency-Aware Structured Separation of Geometry and Feature for Open Set Learning on Point Clouds
OCRT: Boosting Foundation Models in the Open World with Object-Concept-Relation Triad
ATA: Adaptive Transformation Agent for Text-Guided Subject-Position Variable Background Inpainting
TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation
Neural Motion Simulator Pushing the Limit of World Models in Reinforcement Learning
From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing
SIR-DIFF: Sparse Image Sets Restoration with Multi-View Diffusion Model
Detecting Adversarial Data Using Perturbation Forgery
Complementary Advantages: Exploiting Cross-Field Frequency Correlation for NIR-Assisted Image Denoising
One is Plenty: A Polymorphic Feature Interpreter for Immutable Heterogeneous Collaborative Perception
VolFormer: Explore More Comprehensive Cube Interaction for Hyperspectral Image Restoration and Beyond
ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate
LaVin-DiT: Large Vision Diffusion Transformer
HERA: Hybrid Explicit Representation for Ultra-Realistic Head Avatars
Towards Effective and Sparse Adversarial Attack on Spiking Neural Networks via Breaking Invisible Surrogate Gradients
Learning Physics-Based Full-Body Human Reaching and Grasping from Brief Walking References
FluidNexus: 3D Fluid Reconstruction and Prediction from a Single Video
FlexGS: Train Once, Deploy Everywhere with Many-in-One Flexible 3D Gaussian Splatting
Sim-to-Real Causal Transfer: A Metric Learning Approach to Causally-Aware Interaction Representations
4D-Fly: Fast 4D Reconstruction from a Single Monocular Video
A Unified Framework for Heterogeneous Semi-supervised Learning
CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning
EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance
Scene-agnostic Pose Regression for Visual Localization
Enhancing Testing-Time Robustness for Trusted Multi-View Classification in the Wild
Advancing Multiple Instance Learning with Continual Learning for Whole Slide Imaging
Accelerating Diffusion Transformer via Increment-Calibrated Caching with Channel-Aware Singular Value Decomposition
EventGPT: Event Stream Understanding with Multimodal Large Language Models
DroneSplat: 3D Gaussian Splatting for Robust 3D Reconstruction from In-the-Wild Drone Imagery
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding
Diffusion-based Realistic Listening Head Generation via Hybrid Motion Modeling
Hierarchical Compact Clustering Attention (COCA) for Unsupervised Object-Centric Learning
MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention
Spiking Transformer with Spatial-Temporal Attention
Multi-Modal Synergistic Implicit Image Enhancement for Efficient Optical Flow Estimation
Pay Attention to the Foreground in Object-Centric Learning
The Devil is in Low-Level Features for Cross-Domain Few-Shot Segmentation
FruitNinja: 3D Object Interior Texture Generation with Gaussian Splatting
Language-Guided Salient Object Ranking
VODiff: Controlling Object Visibility Order in Text-to-Image Generation
Tartan IMU: A Light Foundation Model for Inertial Positioning in Robotics
Yo’Chameleon: Personalized Vision and Language Generation
GO-N3RDet: Geometry Optimized NeRF-enhanced 3D Object Detector
FreePCA: Integrating Consistency Information across Long-short Frames in Training-free Long Video Generation via Principal Component Analysis
METASCENES: Towards Automated Replica Creation for Real-world 3D Scans
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Multirate Neural Image Compression with Adaptive Lattice Vector Quantization
BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature
Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution
Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Mutimodal Models
UNIALIGN: Scaling Multimodal Alignment within One Unified Model
VITED: Video Temporal Evidence Distillation
The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation
DOF-GS: Adjustable Depth-of-Field 3D Gaussian Splatting for Post-Capture Refocusing, Defocus Rendering and Blur Removal
One-shot 3D Object Canonicalization based on Geometric and Semantic Consistency
BG-Triangle: Bézier Gaussian Triangle for 3D Vectorization and Rendering
SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model
GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill
ArtiFade: Learning to Generate High-quality Subject from Blemished Images
VisionZip: Longer is Better but Not Necessary in Vision Language Models
GA3CE: Unconstrained 3D Gaze Estimation with Gaze-Aware 3D Context Encoding
RC-AutoCalib: An End-to-End Radar-Camera Automatic Calibration Network
H-MoRe: Learning Human-centric Motion Representation for Action Analysis
Reasoning Mamba: Hypergraph-Guided Region Relation Calculating for Weakly Supervised Affordance Grounding
Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera
Feat2GS: Probing Visual Foundation Models with Gaussian Splatting
DnLUT: Ultra-Efficient Color Image Denoising via Channel-Aware Lookup Tables
PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution
Proximal Algorithm Unrolling: Flexible and Efficient Reconstruction Networks for Single-Pixel Imaging
FrugalNeRF: Fast Convergence for Extreme Few-shot Novel View Synthesis without Learned Priors
GCC: Generative Color Constancy via Diffusing a Color Checker
SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes
BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation
AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting
GIFStream: 4D Gaussian-based Immersive Video with Feature Stream
WISE: A Framework for Gigapixel Whole-Slide-Image Lossless Compression
RGBAvatar: Reduced Gaussian Blendshapes for Online Modeling of Head Avatars
Visual-Instructed Degradation Diffusion for All-in-One Image Restoration
Ref-GS: Directional Factorization for 2D Gaussian Splatting
Horizon-GS: Unified 3D Gaussian Splatting for Large-Scale Aerial-to-Ground Scenes
AnyMoLe: Any Character Motion In-betweening Leveraging Video Diffusion Models
H2ST: Hierarchical Two-Sample Tests for Continual Out-of-Distribution Detection
FFaceNeRF: Few-shot Face Editing in Neural Radiance Fields
Task-Specific Gradient Adaptation for Few-Shot One-Class Classification
Attribute-Missing Multi-view Graph Clustering
3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning
HuPerFlow: A Comprehensive Benchmark for Human vs. Machine Motion Estimation Comparison
UniK3D: Universal Camera Monocular 3D Estimation
Illumination Spectrum Estimation for Multispectral Images via Surface Reflectance Modeling and Spatial-Spectral Feature Generation
Solving Instance Detection from an Open-World Perspective
InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment
STEPS: Sequential Probability Tensor Estimation for Text-to-Image Hard Prompt Search
SparseAlign: a Fully Sparse Framework for Cooperative Object Detection
Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach
PosterMaker: Towards High-Quality Product Poster Generation with Accurate Text Rendering
Event Ellipsometer: Event-based Mueller-Matrix Video Imaging
Olympus: A Universal Task Router for Computer Vision Tasks
GPVK-VL: Geometry-Preserving Virtual Keyframes for Visual Localization under Large Viewpoint Changes
Identity-Preserving Text-to-Video Generation by Frequency Decomposition
Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes
Magma: A Foundation Model for Multimodal AI Agents
Zero-Shot 4D Lidar Panoptic Segmentation
Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation
LidarGait++: Learning Local Features and Size Awareness from LiDAR Point Clouds for 3D Gait Recognition
NeISF++: Neural Incident Stokes Field for Polarized Inverse Rendering of Conductors and Dielectrics
PromptHMR: Promptable Human Mesh Recovery
UMotion: Uncertainty-driven Human Motion Estimation from Inertial and Ultra-wideband Units
Classifier-Free Guidance Inside the Attraction Basin May Cause Memorization
Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency
Probabilistic Prompt Distribution Learning for Animal Pose Estimation
RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness
RoomPainter: View-Integrated Diffusion for Consistent Indoor Scene Texturing
MBQ: Modality-Balanced Quantization for Large Vision-Language Models
VTON 360: High-Fidelity Virtual Try-On from Any Viewing Direction
Associative Transformer
Adaptive Part Learning for Fine-Grained Generalized Category Discovery: A Plug-and-Play Enhancement
Implicit Bias Injection Attacks against Text-to-Image Diffusion Models
D^3: Scaling Up Deepfake Detection by Learning from Discrepancy
Revisiting Source-Free Domain Adaptation: Insights into Representativeness, Generalization, and Variety
MoFlow: One-Step Flow Matching for Human Trajectory Forecasting via Implicit Maximum Likelihood Estimation based Distillation
EEE-Bench: A Comprehensive Multimodal Electrical And Electronics Engineering Benchmark
FedMIA: An Effective Membership Inference Attack Exploiting "All for One" Principle in Federated Learning
A Physics-Informed Blur Learning Framework for Imaging Systems
Segment Any-Quality Images with Generative Latent Space Enhancement
Theoretical Insights in Model Inversion Robustness and Conditional Entropy Maximization for Collaborative Inference Systems
UA-Pose: Uncertainty-Aware 6D Object Pose Estimation and Online Object Completion with Partial References
g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks
DeNVeR: Deformable Neural Vessel Representations for Unsupervised Video Vessel Segmentation
Inference-Scale Complexity in ANN-SNN Conversion for High-Performance and Low-Power Applications
Self-Supervised Learning for Color Spike Camera Reconstruction
USP-Gaussian: Unifying Spike-based Image Reconstruction, Pose Correction and Gaussian Splatting
Sparse2DGS: Geometry-Prioritized Gaussian Splatting for Surface Reconstruction from Sparse Views
GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
ATP: Adaptive Threshold Pruning for Efficient Data Encoding in Quantum Neural Networks
Can Text-to-Video Generation help Video-Language Alignment?
PerLA: Perceptive 3D Language Assistant
Sparse Point Cloud Patches Rendering via Splitting 2D Gaussians
SGCR: Spherical Gaussians for Efficient 3D Curve Reconstruction
EdgeMovingNet: Edge-preserving Point Cloud Reconstruction via Joint Geometry Features
CoE: Chain-of-Explanation via Automatic Visual Concept Circuit Description and Polysemanticity Quantification
VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding
On-Device Self-Supervised Learning of Low-Latency Monocular Depth from Only Events
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation
VoCo-LLaMA: Towards Vision Compression with Large Language Models
ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
Jailbreaking the Non-Transferable Barrier via Test-Time Data Disguising
Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos
LOGICZSL: Exploring Logic-induced Representation for Compositional Zero-shot Learning
Active Data Curation Effectively Distills Large-Scale Multimodal Models
Model Poisoning Attacks to Federated Learning via Multi-Round Consistency
Attraction Diminishing and Distributing for Few-Shot Class-Incremental Learning
EdgeTAM: On-Device Track Anything Model
Distilled Prompt Learning for Incomplete Multimodal Survival Prediction
InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions
Floating No More: Object-Ground Reconstruction from a Single Image
InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation
BARD-GS: Blur-Aware Reconstruction of Dynamic Scenes via Gaussian Splatting
Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model
Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery
Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation
Learning Temporally Consistent Video Depth from Video Diffusion Priors
GG-SSMs: Graph-Generating State Space Models
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
A Simple Data Augmentation for Feature Distribution Skewed Federated Learning
Meta-Learning Hyperparameters for Parameter Efficient Fine-Tuning
Adv-CPG: A Customized Portrait Generation Framework with Facial Adversarial Attacks
Video-Bench: Human-Aligned Video Generation Benchmark
Symbolic Representation for Any-to-Any Generative Tasks
Visual Agentic AI for Spatial Reasoning with a Dynamic API
Self-Evolving Visual Concept Library using Vision-Language Critics
GenVDM: Generating Vector Displacement Maps From a Single Image
Enhanced Visual-Semantic Interaction with Tailored Prompts for Pedestrian Attribute Recognition
URWKV: Unified RWKV Model with Multi-state Perspective for Low-light Image Restoration
Seeing A 3D World in A Grain of Sand
DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection
NTClick: Achieving Precise Interactive Segmentation With Noise-tolerant Clicks
One-Minute Video Generation with Test-Time Training
Channel Consistency Prior and Self-Reconstruction Strategy Based Unsupervised Image Deraining
GroomLight: Hybrid Inverse Rendering for Relightable Human Hair Appearance Modeling
AIpparel: A Multimodal Foundation Model for Digital Garments
Track Any Anomalous Object:A Granular Video Anomaly Detection Pipeline
STEP: Enhancing Video-LLMs’ Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training
Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation
DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture
A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training
Diffusion Self-Distillation for Zero-Shot Customized Image Generation
EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion
Birth and Death of a Rose
InsTaG: Learning Personalized 3D Talking Head from Few-Second Video
Online Video Understanding: OVBench and VideoChat-Online
Prototype-Based Image Prompting for Weakly Supervised Histopathological Image Segmentation
Interpretable Image Classification via Non-parametric Part Prototype Learning
Soft Self-labeling and Potts Relaxations for Weakly-supervised Segmentation
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research
MP-SfM: Monocular Surface Priors for Robust Structure-from-Motion
Flash3D: Super-scaling Point Transformers through Joint Hardware-Geometry Locality
ILIAS: Instance-Level Image retrieval At Scale
Re-thinking Temporal Search for Long-Form Video Understanding
3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
Probing the Mid-level Vision Capabilities of Self-Supervised Learning
Conical Visual Concentration for Efficient Large Vision-Language Models
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
MultimodalStudio: A Heterogeneous Sensor Dataset and Framework for Neural Rendering across Multiple Imaging Modalities
MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Action Anticipation
SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction
CroCoDL: Cross-device Collaborative Dataset for Localization
HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion
UIBDiffusion: Universal Imperceptible Backdoor Attack for Diffusion Models
SeqMvRL: A Sequential Fusion Framework for Multi-view Representation Learning
A4A: Adapter for Adapter Transfer via All-for-All Mapping for Cross-Architecture Models
Dragin3D: Image Editing by Dragging in 3D Space
SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation
FeedEdit: Text-Based Image Editing with Dynamic Feedback Regulation
D^2iT: Dynamic Diffusion Transformer for Accurate Image Generation
Are Images Indistinguishable to Humans Also Indistinguishable to Classifiers?
Movie Weaver: Tuning-Free Multi-Concept Video Personalization with Anchored Prompts
Beyond Human Perception: Understanding Multi-Object World from Monocular View
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment
Scaling Properties of Diffusion Models For Perceptual Tasks
Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions
SceneCrafter: Controllable Multi-View Driving Scene Editing
ZeroVO: Visual Odometry with Minimal Assumptions
GPAvatar: High-fidelity Head Avatars by Learning Efficient Gaussian Projections
Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation
Understanding Fine-tuning CLIP for Open-vocabulary Semantic Segmentation in Hyperbolic Space
Star with Bilinear Mapping
Domain Generalization in CLIP via Learning with Diverse Text Prompts
Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation
Visual Consensus Prompting for Co-Salient Object Detection
Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge
Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking
MambaIC: State Space Models for High-Performance Learned Image Compression
ChainHOI: Joint-based Kinematic Chain Modeling for Human-Object Interaction Generation
MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes
Temporal Score Analysis for Understanding and Correcting Diffusion Artifacts
Six-CD: Benchmarking Concept Removals for Text-to-image Diffusion Models
NADER: Neural Architecture Design via Multi-Agent Collaboration
Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models
Unlocking Generalization Power in LiDAR Point Cloud Registration
Alias-Free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space
MEET: Towards Memory-Efficient Temporal Sparse Deep Neural Networks
Gaussian Splashing: Unified Particles for Versatile Motion Synthesis and Rendering
Population Normalization for Federated Learning
Generating 3D-Consistent Videos from Unposed Internet Photos
Turbo3D: Ultra-fast Text-to-3D Generation
GenAssets: Generating in-the-wild 3D Assets in Latent Space
Context-Aware Multimodal Pretraining
FLAIR: VLM with Fine-grained Language-informed Image Representations
How to Merge Your Multimodal Models Over Time?
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
SAMBLE: Shape-Specific Point Cloud Sampling for an Optimal Trade-Off Between Local Detail and Global Uniformity
Event-based Video Super-Resolution via State Space Models
NoPain: No-box Point Cloud Attack via Optimal Transport Singular Boundary
Automated Proof of Polynomial Inequalities via Reinforcement Learning
Learning-enabled Polynomial Lyapunov Function Synthesis via High-Accuracy Counterexample-Guided Framework
T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting
Instant Adversarial Purification with Adversarial Consistency Distillation
Noise-Consistent Siamese-Diffusion for Medical Image Synthesis and Segmentation
DoF-Gaussian: Controllable Depth-of-Field for 3D Gaussian Splatting
CH3Depth: Efficient and Flexible Depth Foundation Model with Flow Matching
TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion
FOCUS: Knowledge-enhanced Adaptive Visual Compression for Few-shot Whole Slide Image Classification
Balancing Two Classifiers via A Simplex ETF Structure for Model Calibration
AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM
Shadow Generation Using Diffusion Model with Geometry Prior
FineVQ: Fine-Grained User Generated Content Video Quality Assessment
MExD: An Expert-Infused Diffusion Model for Whole-Slide Image Classification
Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector
Learning Phase Distortion with Selective State Space Models for Video Turbulence Mitigation
Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis
Generalized Few-shot 3D Point Cloud Segmentation with Vision-Language Models
Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis
Multi-focal Conditioned Latent Diffusion for Person Image Synthesis
Diff-Palm: Realistic Palmprint Generation with Polynomial Creases and Intra-Class Variation Controllable Diffusion Models
Nested Diffusion Models Using Hierarchical Latent Priors
UrbanCAD: Towards Highly Controllable and Photorealistic 3D Vehicles for Urban Scene Simulation
CADCrafter: Generating Computer-Aided Design Models from Unconstrained Images
IndoorGS: Geometric Cues Guided Gaussian Splatting for Indoor Scene Reconstruction
Bridging the Vision-Brain Gap with an Uncertainty-Aware Blur Prior
HVI: A New Color Space for Low-light Image Enhancement
CADDreamer: CAD Object Generation from Single-view Images
Rethinking the Adversarial Robustness of Multi-Exit Neural Networks in an Attack-Defense Game
MFogHub: Bridging Multi-Regional and Multi-Satellite Data for Global Marine Fog Detection and Forecasting
Localizing Events in Videos with Multimodal Queries
FedBiP: Heterogeneous One-Shot Federated Learning with Personalized Latent Diffusion Models
Rethinking Spiking Self-Attention Mechanism: Implementing α-XNOR Similarity Calculation in Spiking Transformers
GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities
Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction
EnergyMoGen: Compositional Human Motion Generation with Energy-Based Diffusion Model in Latent Space
SGFormer: Satellite-Ground Fusion for 3D Semantic Scene Completion
MotionPRO: Exploring the Role of Pressure in Human MoCap and Beyond
SVDC: Consistent Direct Time-of-Flight Video Depth Completion with Frequency Selective Fusion
ManiVideo: Generating Hand-Object Manipulation Video with Dexterous and Generalizable Grasping
Interpreting Object-level Foundation Models via Visual Precision Search
Query Efficient Black-Box Visual Prompting with Subspace Learning
Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity
Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning
Cross-Modal Interactive Perception Network with Mamba for Lung Tumor Segmentation in PET-CT Images
Object-Shot Enhanced Grounding Network for Egocentric Video
Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders
Taste More, Taste Better: Diverse Data and Strong Model Boost Semi-Supervised Crowd Counting
OSMamba: Omnidirectional Spectral Mamba with Dual-Domain Prior Generator for Exposure Correction
Spk2SRImgNet: Super-Resolve Dynamic Scene from Spike Stream via Motion Aligned Collaborative Filtering
InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception
Secret Lies in Color: Enhancing AI-Generated Images Detection with Color Distribution Analysis
SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments
AVF-MAE++: Scaling Affective Video Facial Masked Autoencoders via Efficient Audio-Visual Self-Supervised Learning
Empowering LLMs to Understand and Generate Complex Vector Graphics
CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction
HaWoR: World-Space Hand Motion Reconstruction from Egocentric Videos
WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild
LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos
RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation
The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion
ECVC: Exploiting Non-Local Correlations in Multiple Frames for Contextual Video Compression
Mesh Mamba: A Unified State Space Model for Saliency Prediction in Non-Textured and Textured Meshes
FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation
HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos
SKDream: Controllable Multi-view and 3D Generation with Arbitrary Skeletons
Learning Dynamic Collaborative Network for Semi-supervised 3D Vessel Segmentation
Weakly Supervised Contrastive Adversarial Training for Learning Robust Features from Semi-supervised Data
Neural Hierarchical Decomposition for Single Image Plant Modeling
Decouple-Then-Merge: Finetune Diffusion Models as Multi-Task Learning
Dataset Distillation with Neural Characteristic Function: A Minmax Perspective
ProReflow: Progressive Reflow with Decomposed Velocity
Let's Verify and Reinforce Image Generation Step by Step
Deterministic-to-Stochastic Diverse Latent Feature Mapping for Human Motion Synthesis
Diffusion-based Event Generation for High-Quality Image Deblurring
RoGSplat: Learning Robust Generalizable Human Gaussian Splatting from Sparse Multi-View Images
Towards All-in-One Medical Image Re-Identification
Rethinking Correspondence-based Category-Level Object Pose Estimation
Structure-Aware Correspondence Learning for Relative Pose Estimation
Uni-Renderer: Unifying Rendering and Inverse Rendering Via Dual Stream Diffusion
Kiss3DGen: Repurposing Image Diffusion Models for 3D Asset Generation
Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model
Temporal Separation with Entropy Regularization for Knowledge Distillation in Spiking Neural Networks
Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts
ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models
CLIP is Almost All You Need: Towards Parameter-Efficient Scene Text Retrieval without OCR
Dinomaly: The Less Is More Philosophy in Multi-Class Unsupervised Anomaly Detection
DreamText: High Fidelity Scene Text Synthesis
MonoInstance: Enhancing Monocular Priors via Multi-view Instance Alignment for Neural Rendering and Reconstruction
UMFN: Unified Multi-Domain Face Normalization for Joint Cross-domain Prototype Learning and Heterogeneous Face Recognition
SinGS: Animatable Single-Image Human Gaussian Splats with Kinematic Priors
Graph-Embedded Structure-Aware Perceptual Hashing for Neural Network Protection and Piracy Detection
EVOS: Efficient Implicit Neural Training via EVOlutionary Selector
Taming Teacher Forcing for Masked Autoregressive Video Generation
LookCloser: Frequency-aware Radiance Field for Tiny-Detail Scene
AniMo: Species-Aware Model for Text-Driven Animal Motion Generation
Simulator HC: Regression-based Online Simulation of Starting Problem-Solution Pairs for Homotopy Continuation in Geometric Vision
CoA: Towards Real Image Dehazing via Compression-and-Adaptation
PICD: Versatile Perceptual Image Compression with Diffusion Rendering
TopNet: Transformer-Efficient Occupancy Prediction Network for Octree-Structured Point Cloud Geometry Compression
Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition
Continuous 3D Perception Model with Persistent State
DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval
See Further When Clear: Curriculum Consistency Model
Cross-Rejective Open-Set SAR Image Registration
Unsupervised Continual Domain Shift Learning with Multi-Prototype Modeling
Enhanced then Progressive Fusion with View Graph for Multi-View Clustering
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos
UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting
BHViT: Binarized Hybrid Vision Transformer
Knowledge Bridger: Towards Training-Free Missing Modality Completion
GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction
Improving Accuracy and Calibration via Differentiated Deep Mutual Learning
Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
Learning Person-Specific Animatable Face Models from In-the-Wild Images via a Shared Base Model
Style Evolving along Chain-of-Thought for Unknown-Domain Object Detection
ODHSR: Online Dense 3D Reconstruction of Humans and Scenes from Monocular Videos
Touch2Shape: Touch-Conditioned 3D Diffusion for Shape Exploration and Reconstruction
NightAdapter: Learning a Frequency Adapter for Generalizable Night-time Scene Segmentation
Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization
ActiveGAMER: Active GAussian Mapping through Efficient Rendering
LiDAR-RT: Gaussian-based Ray Tracing for Dynamic LiDAR Re-simulation
BooW-VTON: Boosting In-the-Wild Virtual Try-On via Mask-Free Pseudo Data Training
DexHandDiff: Interaction-aware Diffusion Planning for Adaptive Dexterous Manipulation
DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition and Surface Reconstruction for Urban Driving Scenes
CompGS: Unleashing 2D Compositionality for Compositional Text-to-3D via Dynamically Optimizing 3D Gaussians
A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning
Towards Universal Dataset Distillation via Task-Driven Diffusion
TexGaussian: Generating High-quality PBR Material via Octree-based 3D Gaussian Splatting
TexGarment: Consistent Garment UV Texture Generation via Efficient 3D Structure-Guided Diffusion Transformer
Point-to-Region Loss for Semi-Supervised Point-Based Crowd Counting
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
From Laboratory to Real World: A New Benchmark Towards Privacy-Preserved Visible-Infrared Person Re-Identification
Supervising Sound Localization by In-the-wild Egomotion
PhysGen3D: Crafting a Miniature Interactive World from a Single Image
UniScene: Unified Occupancy-centric Driving Scene Generation
RDD: Robust Feature Detector and Descriptor using Deformable Transformer
Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment
Full-DoF Egomotion Estimation for Event Cameras Using Geometric Solvers
CCIN: Compositional Conflict Identification and Neutralization for Composed Image Retrieval
V2V3D: View-to-View Denoised 3D Reconstruction for Light Field Microscopy
StickMotion: Generating 3D Human Motions by Drawing a Stickman
D^3CTTA: Domain-Dependent Decorrelation for Continual Test-Time Adaption of 3D LiDAR Segmentation
HeMoRa: Unsupervised Heuristic Consensus Sampling for Robust Point Cloud Registration
QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge
HyperLoRA: Parameter-Efficient Adaptive Generation for Portrait Synthesis
Learning from Neighbors: Category Extrapolation for Long-Tail Learning
MICAS: Multi-grained In-Context Adaptive Sampling for 3D Point Cloud Processing
FlexDrive: Toward Trajectory Flexibility in Driving Scene Gaussian Splatting Reconstruction and Rendering
FreeSim: Toward Free-viewpoint Camera Simulation in Driving Scenes
MambaVO: Deep Visual Odometry Based on Sequential Matching Refinement and Training Smoothing
DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations
Digital Twin Catalog: A Large-Scale Photorealistic 3D Object Digital Twin Dataset
GS-DiT: Advancing Video Generation with Dynamic 3D Gaussian Fields through Efficient Dense 3D Point Tracking
LIRM: Large Inverse Rendering Model for Progressive Reconstruction of Shape, Materials and View-dependent Radiance Fields
NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints
GraphI2P: Image-to-Point Cloud Registration with Exploring Pattern of Correspondence via Graph Learning
SEEN-DA: SEmantic ENtropy guided Domain-aware Attention for Domain Adaptive Object Detection
Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding
DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution
Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
Non-Natural Image Understanding with Advancing Frequency-based Vision Encoders
RoboGround: Robotic Manipulation with Grounded Vision-Language Priors
Towards Transformer-Based Aligned Generation with Self-Coherence Guidance
Efficient Motion-Aware Video MLLM
PillarHist: A Quantization-aware Pillar Feature Encoder based on Height-aware Histogram
OFER: Occluded Face Expression Reconstruction
RSAR: Restricted State Angle Resolver and Rotated SAR Benchmark
Visual Persona: Foundation Model for Full-Body Human Customization
Move-in-2D: 2D-Conditioned Human Motion Generation
UHD-processer: Unified UHD Image Restoration with Progressive Frequency Learning and Degradation-aware Prompts
GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding
Efficient Test-time Adaptive Object Detection via Sensitivity-Guided Pruning
QMambaBSR: Burst Image Super-Resolution with Query State Space Model
A Lightweight UDF Learning Framework for 3D Reconstruction Based on Local Shape Functions
Decoder Gradient Shield: Provable and High-Fidelity Prevention of Gradient-Based Box-Free Watermark Removal
Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition
Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models
All-Optical Nonlinear Diffractive Deep Network for Ultrafast Image Denoising
Learning Affine Correspondences by Integrating Geometric Constraints
STDD: Spatio-Temporal Dual Diffusion for Video Generation
Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation
K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences
Attention Distillation: A Unified Approach to Visual Characteristics Transfer
SMTPD: A New Benchmark for Temporal Prediction of Social Media Popularity
Visual Prompting for One-shot Controllable Video Editing without Inversion
Multi-Sensor Object Anomaly Detection: Unifying Appearance, Geometry, and Internal Properties
Instruction-based Image Manipulation by Watching How Things Move
OSDFace: One-Step Diffusion Model for Face Restoration
Splatter-360: Generalizable 360 Gaussian Splatting for Wide-baseline Panoramic Images
GauSTAR: Gaussian Surface Tracking and Reconstruction
Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera
Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors
ADU: Adaptive Detection of Unknown Categories in Black-Box Domain Adaptation
X-Dyna: Expressive Dynamic Human Image Animation
Face Forgery Video Detection via Temporal Forgery Cue Unraveling
Blood Flow Speed Estimation with Optical Coherence Tomography Angiography Images
SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model
M3amba: Memory Mamba is All You Need for Whole Slide Image Classification
Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly
SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts
FedCALM: Conflict-aware Layer-wise Mitigation for Selective Aggregation in Deeper Personalized Federated Learning
Towards Precise Scaling Laws for Video Diffusion Transformers
Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content
Boosting Adversarial Transferability through Augmentation in Hypothesis Space
Bayesian Prompt Flow Learning for Zero-Shot Anomaly Detection
NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval
Robust-MVTON: Learning Cross-Pose Feature Alignment and Fusion for Robust Multi-View Virtual Try-On
Plug-and-Play PPO: An Adaptive Point Prompt Optimizer Making SAM Greater
EnvPoser: Environment-aware Realistic Human Motion Estimation from Sparse Observations with Uncertainty Modeling
Distilling Spatially-Heterogeneous Distortion Perception for Blind Image Quality Assessment
ScaleLSD: Scalable Deep Line Segment Detection Streamlined
LiVOS: Light Video Object Segmentation with Gated Linear Matching
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
LP-Diff: Towards Improved Restoration of Real-World Degraded License Plate
HyperSeg: Hybrid Segmentation Assistant with Fine-grained Visual Perceiver
VideoDPO: Omni-Preference Alignment for Video Diffusion Generation
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
Parallelized Autoregressive Visual Generation
VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving
K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs
ChatGarment: Garment Estimation, Generation and Editing via Large Language Models
HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator
CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI
CarPlanner: Consistent Auto-regressive Trajectory Planning for Large-Scale Reinforcement Learning in Autonomous Driving
StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models
EnvGS: Modeling View-Dependent Appearance with Environment Gaussian
FreeTimeGS: Free Gaussian Primitives at Anytime Anywhere for Dynamic Scene Reconstruction
Towards Explainable and Unprecedented Accuracy in Matching Challenging Finger Crease Patterns
Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization
DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels
MultiGO: Towards Multi-level Geometry Learning for Monocular 3D Textured Human Reconstruction
Balanced Rate-Distortion Optimization in Learned Image Compression
Interactive Medical Image Analysis with Concept-based Similarity Reasoning
TAGA: Self-supervised Learning for Template-free Animatable Gaussian Articulated Model
Rashomon Sets for Prototypical-Part Networks: Editing Interpretable Models in Real-Time
Layered Image Vectorization via Semantic Simplification
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought
HD-EPIC: A Highly-Detailed Egocentric Video Dataset
Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification
SpecTRe-GS: Modeling Highly Specular Surfaces with Reflected Nearby Objects by Tracing Rays in 3D Gaussian Splatting
DashGaussian: Optimizing 3D Gaussian Splatting in 200 Seconds
IMFine: 3D Inpainting via Geometry-guided Multi-view Refinement
Empowering Large Language Models with 3D Situation Awareness
Classic Video Denoising in a Machine Learning World: Robust, Fast, and Controllable
VASparse: Towards Efficient Visual Hallucination Mitigation via Visual-Aware Token Sparsification
RaSS: Improving Denoising Diffusion Samplers with Reinforced Active Sampling Scheduler
Font-Agent: Enhancing Font Understanding with Large Language Models
OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking
Breaking the Memory Barrier of Contrastive Loss via Tile-Based Strategy
DELT: A Simple Diversity-driven EarlyLate Training for Dataset Distillation
Prior-free 3D Object Tracking
Emphasizing Discriminative Features for Dataset Distillation in Complex Scenarios
Words or Vision: Do Vision-Language Models Have Blind Faith in Text?
AutoPresent: Designing Structured Visuals from Scratch
ID-Patch: Robust ID Association for Group Photo Personalization
COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection
Steepest Descent Density Control for Compact 3D Gaussian Splatting
Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields
OpenSDI: Spotting Diffusion-Generated Images in the Open World
Vid2Sim: Generalizable, Video-based Reconstruction of Appearance, Geometry and Physics for Mesh-free Simulation
Align3R: Aligned Monocular Depth Estimation for Dynamic Videos
ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model
Teaching Large Language Models to Regress Accurate Image Quality Scores Using Score Distribution
UltraFusion: Ultra High Dynamic Imaging using Exposure Fusion
BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence
GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding
PanoGS: Gaussian-based Panoptic Segmentation for 3D Open Vocabulary Scene Understanding
Lessons and Insights from a Unifying Study of Parameter-Efficient Fine-Tuning (PEFT) in Visual Recognition
Hierarchical Adaptive Filtering Network for Text Image Specular Highlight Removal
FlexUOD: The Answer to Real-world Unsupervised Image Outlier Detection
IDOL: Instant Photorealistic 3D Human Creation from a Single Image
Dense-To-Sparse Video Diffusion For High-fidelity Multi-View Images Synthesis
Rethinking Personalized Aesthetics Assessment: Employing Physique Aesthetics Assessment as An Exemplification
FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning
Harnessing Frequency Spectrum Insights for Image Copyright Protection Against Diffusion Models
PoseTraj: Pose-Aware Trajectory Control in Video Diffusion
UCM-VeID V2: A Richer Dataset and A Pre-training Method for UAV Cross-Modality Vehicle Re-Identification
AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea
Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
StageDesigner: Artistic Stage Generation for Scenography via Theater Scripts
STAA-SNN: Spatial-Temporal Attention Aggregator for Spiking Neural Networks
Anomize: Better Open Vocabulary Video Anomaly Detection
HyperFree: A Channel-adaptive and Tuning-free Foundation Model for Hyperspectral Remote Sensing Imagery
Volume Tells: Dual Cycle-Consistent Diffusion for 3D Fluorescence Microscopy De-noising and Super-Resolution
DexGrasp Anything: Towards Universal Robotic Dexterous Grasping with Physics Awareness
MaskGaussian: Adaptive 3D Gaussian Representation from Probabilistic Masks
Unboxed: Geometrically and Temporally Consistent Video Outpainting
Less is More: Efficient Model Merging with Binary Task Switch
FreeGave: 3D Physics Learning from Dynamic Videos by Gaussian Velocity
PIDSR: Complementary Polarized Image Demosaicing and Super-Resolution
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection
Image is All You Need to Empower Large-scale Diffusion Models for In-Domain Generation
Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation
Navigation World Models
LITA-GS: Illumination-Agnostic Novel View Synthesis via Reference-Free 3D Gaussian Splatting and Physical Priors
Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model
Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video Diffusion Transformer
TKG-DM: Training-free Chroma Key Content Generation Diffusion Model
Reconstructing In-the-Wild Open-Vocabulary Human-Object Interactions
Forensics Adapter: Adapting CLIP for Generalizable Face Forgery Detection
Consistency-aware Self-Training for Iterative-based Stereo Matching
Unlearning through Knowledge Overwriting: Reversible Federated Unlearning via Selective Sparse Adapter
MLVU: Benchmarking Multi-task Long Video Understanding
OmniGen: Unified Image Generation
NeRFPrior: Learning Neural Radiance Field as a Prior for Indoor Scene Reconstruction
Semi-Supervised State-Space Model with Dynamic Stacking Filter for Real-World Video Deraining
Audio-Visual Instance Segmentation
Improving the Transferability of Adversarial Attacks on Face Recognition with Diverse Parameters Augmentation
GraphMimic: Graph-to-Graphs Generative Modeling from Videos for Policy Learning
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
DrivingSphere: Building a High-fidelity 4D World for Closed-loop Simulation
GBC-Splat: Generalizable Gaussian-Based Clothed Human Digitalization under Sparse RGB Cameras
Implicit Correspondence Learning for Image-to-Point Cloud Registration
Generative Map Priors for Collaborative BEV Semantic Segmentation
Revisiting Audio-Visual Segmentation with Vision-Centric Transformer
Visual Lexicon: Rich Image Features in Language Space
Decoupled Distillation to Erase: A General Unlearning Method for Any Class-centric Tasks
nnWNet: Rethinking the Use of Transformers in Biomedical Image Segmentation and Calling for a Unified Evaluation Benchmark
FlashSloth : Lightning Multimodal Large Language Models via Embedded Visual Compression
Mamba-Reg: Vision Mamba Also Needs Registers
Adventurer: Optimizing Vision Mamba Architecture Designs for Efficiency
Learning Compatible Multi-Prize Subnetworks for Asymmetric Retrieval
MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking
HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization
Towards General Visual-Linguistic Face Forgery Detection
Decompositional Neural Scene Reconstruction with Generative Diffusion Prior
Free360: Layered Gaussian Splatting for Unbounded 360-Degree View Synthesis from Extremely Sparse and Unposed Views
ACAttack: Adaptive Cross Attacking RGB-T Tracker via Multi-Modal Response Decoupling
CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning
MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation
ProAPO: Progressively Automatic Prompt Optimization for Visual Classification
Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval
Enhancing Diversity for Data-free Quantization
Pose-Guided Temporal Enhancement for Robust Low-Resolution Hand Reconstruction
CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology
VasTSD: Learning 3D Vascular Tree-state Space Diffusion Model for Angiography Synthesis
OnlineAnySeg: Online Zero-Shot 3D Segmentation by Visual Foundation Model Guided 2D Mask Merging
Test-Time Backdoor Detection for Object Detection Models
Acquire and then Adapt: Squeezing out Text-to-Image Model for Image Restoration
COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts
beta-FFT: Nonlinear Interpolation and Differentiated Training Strategies for Semi-Supervised Medical Image Segmentation
Depth-Guided Bundle Sampling for Efficient Generalizable Neural Radiance Field Reconstruction
GoLF-NRT: Integrating Global Context and Local Geometry for Few-Shot View Synthesis
GuardSplat: Efficient and Robust Watermarking for 3D Gaussian Splatting
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
Rethinking Lanes and Points in Complex Scenarios for Monocular 3D Lane Detection
Continual SFT Matches Multimodal RLHF with Negative Supervision
OODD: Test-time Out-of-Distribution Detection with Dynamic Dictionary
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories
AdaDARE-gamma: Balancing Stability and Plasticity in Multi-modal LLMs through Efficient Adaptation
ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding
Feature Spectrum Learning for Remote Sensing Change Detection
FisherTune: Fisher-Guided Robust Tuning of Vision Foundation Models for Domain Generalized Segmentation
Cropper: Vision-Language Model for Image Cropping through In-Context Learning
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
Monocular and Generalizable Gaussian Talking Head Animation
Analyzing the Synthetic-to-Real Domain Gap in 3D Hand Pose Estimation
Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation
Neural Video Compression with Context Modulation
DAMM-Diffusion: Learning Divergence-Aware Multi-Modal Diffusion Model for Nanoparticles Distribution Prediction
Rethinking End-to-End 2D to 3D Scene Segmentation in Gaussian Splatting
INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations
AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer
VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
DPFlow: Adaptive Optical Flow Estimation with a Dual-Pyramid Framework
Exact: Exploring Space-Time Perceptive Clues for Weakly Supervised Satellite Image Time Series Semantic Segmentation
Rethinking Query-based Transformer for Continual Image Segmentation
CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image
HRAvatar: High-Quality and Relightable Gaussian Head Avatar
AffordDP: Generalizable Diffusion Policy with Transferable Affordance
MITracker: Multi-View Integration for Visual Object Tracking
Forming Auxiliary High-confident Instance-level Loss to Promote Learning from Label Proportions
Learning Class Prototypes for Unified Sparse-Supervised 3D Object Detection
FADE: Frequency-Aware Diffusion Model Factorization for Video Editing
EntityErasure: Erasing Entity Cleanly via Amodal Entity Segmentation and Completion
WildGS-SLAM: Monocular Gaussian Splatting SLAM in Dynamic Environments
EntropyMark: Towards More Harmless Backdoor Watermark via Entropy-based Constraint for Open-source Dataset Copyright Protection
Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding
VoxelSplat: Dynamic Gaussian Splatting as an Effective Loss for Occupancy and Flow Prediction
ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting
OPTICAL: Leveraging Optimal Transport for Contribution Allocation in Dataset Distillation
I2VGuard: Safeguarding Images against Misuse in Diffusion-based Image-to-Video Models
Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters
DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models
MotionPro: A Precise Motion Controller for Image-to-Video Generation
Making Old Film Great Again: Degradation-aware State Space Model for Old Film Restoration
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
EchoTraffic: Enhancing Traffic Anomaly Understanding with Audio-Visual Insights
Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key
HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation
Incomplete Multi-modal Brain Tumor Segmentation via Learnable Sorting State Space Model
VISTREAM: Improving Computation Efficiency of Visual Streaming Perception via Law-of-Charge-Conservation Inspired Spiking Neural Network
SuperPC: A Single Diffusion Model for Point Cloud Completion, Upsampling, Denoising, and Colorization
Cross-Modal 3D Representation with Multi-View Images and Point Clouds
Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos
SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language
Radio Frequency Ray Tracing with Neural Object Representation for Enhanced RF Modeling
SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer
Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient
Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better
FedSPA: Generalizable Federated Graph Learning under Homophily Heterogeneity
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Frequency-Biased Synergistic Design for Image Compression and Compensation
Learning to Normalize on the SPD Manifold under Bures-Wasserstein Geometry
LogoSP: Local-global Grouping of Superpoints for Unsupervised Semantic Segmentation of 3D Point Clouds
Learning Flow Fields in Attention for Controllable Person Image Generation
MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving
Dual Diffusion for Unified Image Generation and Understanding
Detecting Backdoor Attacks in Federated Learning via Direction Alignment Inspection
ModeSeq: Taming Sparse Multimodal Motion Prediction with Sequential Mode Modeling
Iterative Predictor-Critic Code Decoding for Real-World Image Dehazing
Open Ad-hoc Categorization with Contextualized Feature Learning
DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention
Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation
FG^2: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching
CoMatcher: Multi-View Collaborative Feature Matching
Bridging Modalities: Improving Universal Multimodal Retrieval by Multimodal Large Language Models
RandAR: Decoder-only Autoregressive Visual Generation in Random Orders
GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation
AIM-Fair: Advancing Algorithmic Fairness via Selectively Fine-Tuning Biased Models with Contextual Synthetic Data
FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding
EfficientLLaVA: Generalizable Auto-Pruning for Large Vision-language Models
Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding
MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation
UniGoal: Towards Universal Zero-shot Goal-oriented Navigation
DeformCL: Learning Deformable Centerline Representation for Vessel Extraction in 3D Medical Image
Task-driven Image Fusion with Learnable Fusion Loss
Synthetic Visual Genome
Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model
Reconciling Stochastic and Deterministic Strategies for Zero-shot Image Restoration using Diffusion Model in Dual
ConceptGuard: Continual Personalized Text-to-Image Generation with Forgetting and Confusion Mitigation
Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning
Training-free Dense-Aligned Diffusion Guidance for Modular Conditional Image Synthesis
Time of the Flight of the Gaussians: Optimizing Depth Indirectly in Dynamic Radiance Fields
MeGA: Hybrid Mesh-Gaussian Head Avatar for High-Fidelity Rendering and Head Editing
PrEditor3D: Fast and Precise 3D Shape Editing
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation
ProtoDepth: Unsupervised Continual Depth Completion with Prototypes
R2C: Mapping Room to Chessboard to Unlock LLM As Low-Level Action Planner
Advancing Adversarial Robustness in GNeRFs: The IL2-NeRF Attack
Co-Speech Gesture Video Generation with Implicit Motion-Audio Entanglement
Seeing the Abstract: Translating the Abstract Language for Vision Language Models
VISTA3D: A Unified Segmentation Foundation Model For 3D Medical Imaging
VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge
AvatarArtist: Open-Domain 4D Avatarization
Plug-and-Play Versatile Compressed Video Enhancement
Evaluating Vision-Language Models as Evaluators in Path Planning
DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation
InsightEdit: Towards Better Instruction Following for Image Editing
Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
MINIMA: Modality Invariant Image Matching
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization
AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
DreamOmni: Unified Image Generation and Editing
SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-image Diffusion Models
Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward
LUCAS: Layered Universal Codec Avatars
Inversion Circle Interpolation: Diffusion-based Image Augmentation for Data-scarce Classification
Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation
Motions as Queries: One-Stage Multi-Person Holistic Human Motion Capture
AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction
PAVE: Patching and Adapting Video Large Language Models
IterIS: Iterative Inference-Solving Alignment for LoRA Merging
LLM-driven Multimodal and Multi-Identity Listening Head Generation
STCOcc: Sparse Spatial-Temporal Cascade Renovation for 3D Occupancy and Scene Flow Prediction
FATE: Full-head Gaussian Avatar with Textural Editing from Monocular Video
LMO: Linear Mamba Operator for MRI Reconstruction
Learning Visual Generative Priors without Text
Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning
Contextual AD Narration with Interleaved Multimodal Sequence
Mimir: Improving Video Diffusion Models for Precise Text Understanding
MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation
FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering
When Domain Generalization meets Generalized Category Discovery: An Adaptive Task-Arithmetic Driven Approach
From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons
OSV: One Step is Enough for High-Quality Image to Video Generation
Adapting Dense Matching for Homography Estimation with Grid-based Acceleration
MangaNinja: Line Art Colorization with Precise Reference Following
Show and Segment: Universal Medical Image Segmentation via In-Context Learning
MLLM-as-a-Judge for Image Safety without Human Labeling
Can Generative Video Models Help Pose Estimation?
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation
MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization
DaCapo: Score Distillation as Stacked Bridge for Fast and High-quality 3D Editing
ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary
Model Diagnosis and Correction via Linguistic and Implicit Attribute Editing
ObjectMover: Generative Object Movement with Video Prior
Generative Image Layer Decomposition with Visual Effects
MagicQuill: An Intelligent Interactive Image Editing System
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Improved Video VAE for Latent Video Diffusion Model
SkillMimic: Learning Basketball Interaction Skills from Demonstrations
Towards Continual Universal Segmentation
R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning
CASP: Consistency-aware Audio-induced Saliency Prediction Model for Omnidirectional Video
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts
Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D Data
Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach
MVBoost: Boost 3D Reconstruction with Multi-View Refinement
AnomalyNCD: Towards Novel Anomaly Class Discovery in Industrial Scenarios
SpiritSight Agent: Advanced GUI Agent with One Look
Revealing Key Details to See Differences: A Novel Prototypical Perspective for Skeleton-based Action Recognition
Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction
Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs
IRIS: Inverse Rendering of Indoor Scenes from Low Dynamic Range Images
Volumetric Surfaces: Representing Fuzzy Geometries with Layered Meshes
Your Scale Factors are My Weapon: Targeted Bit-Flip Attacks on Vision Transformers via Scale Factor Manipulation
ReCap: Better Gaussian Relighting with Cross-Environment Captures
Complexity Experts are Task-Discriminative Learners for Any Image Restoration
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation
HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation
Stabilizing and Accelerating Autofocus with Expert Trajectory Regularized Deep Reinforcement Learning
PromptHash: Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing Retrieval
Improve Representation for Imbalanced Regression through Geometric Constraints
Adaptive Dropout: Unleashing Dropout across Layers for Generalizable Image Super-Resolution
Leveraging SD Map to Augment HD Map-based Trajectory Prediction
MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation
EasyCraft: A Robust and Efficient Framework for Automatic Avatar Crafting
Hierarchical Knowledge Prompt Tuning for Multi-task Test-Time Adaptation
IM-Portrait: Learning 3D-aware Video Diffusion for Photorealistic Talking Heads from Monocular VideosC
AirRoom: Objects Matter in Room Reidentification
World-consistent Video Diffusion with Explicit 3D Modeling
MODfinity: Unsupervised Domain Adaptation with Multimodal Information Flow Intertwining
PhyS-EdiT: Physics-aware Semantic Image Editing with Text Description
Unified Reconstruction of Static and Dynamic Scenes from Events
CaricatureBooth: Data-Free Interactive Caricature Generation in a Photo Booth
FFR: Frequency Feature Rectification for Weakly Supervised Semantic Segmentation
Adaptive Unimodal Regulation for Balanced Multimodal Information Acquisition
Explaining Domain Shifts in Language: Concept Erasing for Interpretable Image Classification
Discovering Fine-Grained Visual-Concept Relations by Disentangled Optimal Transport Concept Bottleneck Models
4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models
TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization
UCOD-DPL: Unsupervised Camouflaged Object Detection via Dynamic Pseudo-label Learning
Evolving High-Quality Rendering and Reconstruction in a Unified Framework with Contribution-Adaptive Regularization
GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction
Comprehensive Relighting: Generalizable and Consistent Monocular Human Relighting and Harmonization
SynthLight: Portrait Relighting with Diffusion Model by Learning to Re-render Synthetic Faces
MegaSynth: Scaling Up 3D Scene Reconstruction with Synthesized Data
ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation
Arbitrary-steps Image Super-resolution via Diffusion Inversion
MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds
D2SP: Dynamic Dual-Stage Purification Framework for Dual Noise Mitigation in Vision-based Affective Recognition.
Less Attention is More: Prompt Transformer for Generalized Category Discovery
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection
Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding
BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs
SynTab-LLaVA: Enhancing Multimodal Table Understanding with Decoupled Synthesis
BADGR: Bundle Adjustment Diffusion Conditioned by Gradients for Wide-Baseline Floor Plan Reconstruction
Federated Learning with Domain Shift Eraser
Scaling up Image Segmentation across Data and Tasks
PanDA: Towards Panoramic Depth Anything with Unlabeled Panoramas and Mobius Spatial Augmentation
Speedy-Splat: Fast 3D Gaussian Splatting with Sparse Pixels and Sparse Primitives
PUP 3D-GS: Principled Uncertainty Pruning for 3D Gaussian Splatting
Material Anything: Generating Materials for Any 3D Object via Diffusion
Generative Gaussian Splatting for Unbounded 3D City Generation
LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes
Disco4D: Disentangled 4D Human Generation and Animation from a Single Image
HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
WildAvatar: Learning In-the-wild 3D Avatars from the Web
S2D-LFE: Sparse-to-Dense Light Field Event Generation
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters
AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers
EgoLife: Towards Egocentric Life Assistant
Learning Bijective Surface Parameterization for Inferring Signed Distance Functions from Sparse Point Clouds with Grid Deformation
A Unified Model for Compressed Sensing MRI Across Undersampling Patterns
MAGE : Single Image to Material-Aware 3D via the Multi-View G-Buffer Estimation Model
LOCORE: Image Re-ranking with Long-Context Sequence Modeling
Homogeneous Dynamics Space for Heterogeneous Humans
Linear Attention Modeling for Learned Image Compression
Decoupled Motion Expression Video Segmentation
Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion
SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement
Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation
Detect Any Mirrors: Boosting Learning Reliability on Large-Scale Unlabeled Data with an Iterative Data Engine
SnowMaster: Comprehensive Real-world Image Desnowing via MLLM with Multi-Model Feedback Optimization
IM-Zero: Instance-level Motion Controllable Video Generation in a Zero-shot Manner
Watermarking One for All: A Robust Watermarking Scheme Against Partial Image Theft
Beyond Generation: A Diffusion-based Low-level Feature Extractor for Detecting AI-generated Images
Incremental Object Keypoint Learning
DriveGPT4-V2: Harnessing Large Language Model Capabilities for Enhanced Closed-Loop Autonomous Driving
Completion as Enhancement: A Degradation-Aware Selective Image Guided Network for Depth Completion
BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers
EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation
StableAnimator: High-Quality Identity-Preserving Human Image Animation
3D Prior Is All You Need: Cross-Task Few-shot 2D Gaze Estimation
STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding
Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation
SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding
From Poses to Identity: Training-Free Person Re-Identification via Feature Centralization
CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos
SPC-GS: Gaussian Splatting with Semantic-Prompt Consistency for Indoor Open-World Free-view Synthesis from Sparse Inputs
Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding
Tightening Robustness Verification of MaxPool-based Neural Networks via Minimizing the Over-Approximation Zone
JTD-UAV: MLLM-Enhanced Joint Tracking and Description Framework for Anti-UAV Systems
PMA: Towards Parameter-Efficient Point Cloud Understanding via Point Mamba Adapter
Adapting Pre-trained 3D Models for Point Cloud Video Understanding via Cross-frame Spatio-temporal Perception
Embracing Collaboration Over Competition: Condensing Multiple Prompts for Visual In-Context Learning
MambaIRv2: Attentive State Space Restoration
SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation
Alignment, Mining and Fusion: Representation Alignment with Hard Negative Mining and Selective Knowledge Fusion for Medical Visual Question Answering
LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant
DCEvo: Discriminative Cross-Dimensional Evolutionary Learning for Infrared and Visible Image Fusion
DifIISR: A Diffusion Model with Gradient Guidance for Infrared Image Super-Resolution
Human Motion Instruction Tuning
RealEdit: Reddit Edits As a Large-scale Empirical Dataset for Image Transformations
Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models
Continuous Adverse Weather Removal via Degradation-Aware Distillation
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models
Dual Semantic Guidance for Open Vocabulary Semantic Segmentation
Instant Gaussian Stream: Fast and Generalizable Streaming of Dynamic Scene Reconstruction via Gaussian Splatting
Reproducible Vision-Language Models Meet Concepts Out of Pre-Training
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
Towards Smart Point-and-Shoot Photography
Number it: Temporal Grounding Videos like Flipping Manga
NVILA: Efficient Frontier Visual Language Models
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement
CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP
Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection
SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device
Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs
Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content
Learning Hazing to Dehazing: Towards Realistic Haze Generation for Real-World Image Dehazing
Cross-modal Information Flow in Multimodal Large Language Models
Image Quality Assessment: From Human to Machine Preference
Continuous Space-Time Video Resampling with Invertible Motion Steganography
Fitted Neural Lossless Image Compression
HomoGen: Enhanced Video Inpainting via Homography Propagation and Diffusion
Boost the Inference with Co-training: A Depth-guided Mutual Learning Framework for Semi-supervised Medical Polyp Segmentation
Improving the Training of Data-Efficient GANs via Quality Aware Dynamic Discriminator Rejection Sampling
Generative Video Propagation
TransPixeler: Advancing Text-to-Video Generation with Transparency
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity
TurboFill: Adapting Few-step Text-to-image Model for Fast Image Inpainting
Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs
SURGEON: Memory-Adaptive Fully Test-Time Adaptation via Dynamic Activation Sparsity
The Scene Language: Representing Scenes with Programs, Words, and Embeddings
MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D
Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction
PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model
Balanced Direction from Multifarious Choices: Arithmetic Meta-Learning for Domain Generalization
MonSter: Marry Monodepth to Stereo Unleashes Power
Nonisotropic Gaussian Diffusion for Realistic 3D Human Motion Prediction
One Model for ALL: Low-Level Task Interaction Is a Key to Task-Agnostic Image Fusion
One-for-More: Continual Diffusion Model for Anomaly Detection
SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting
Data Synthesis with Diverse Styles for Face Recognition via 3DMM-Guided Diffusion
Accurate Differential Operators for Hybrid Neural Fields
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
Mamba-Adaptor: State Space Model Adaptor for Visual Recognition
Visual and Semantic Prompt Collaboration for Generalized Zero-Shot Learning
Dense Match Summarization for Faster Two-view Estimation
FSHNet: Fully Sparse Hybrid Network for 3D Object Detection
High-quality Point Cloud Oriented Normal Estimation via Hybrid Angular and Euclidean Distance Encoding
MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation
Flexible Group Count Enables Hassle-Free Structured Pruning
Exploring Contextual Attribute Density in Referring Expression Counting
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling
DEAL: Data-Efficient Adversarial Learning for High-Quality Infrared Imaging
SocialGesture: Delving into Multi-person Gesture Understanding
GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery
Training Data Provenance Verification: Did Your Model Use Synthetic Data from My Generative Model for Training?
2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices
Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data
Dynamic Camera Poses and Where to Find Them
Identity-Clothing Similarity Modeling for Unsupervised Clothing Change Person Re-Identification
The Power of Context: How Multimodality Improves Image Super-Resolution
Optimal Transport-Guided Source-Free Adaptation for Face Anti-Spoofing
FedCS: Coreset Selection for Federated Learning
OW-OVD: Unified Open World and Open Vocabulary Object Detection
RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings
CRISP: Object Pose and Shape Estimation with Test-Time Adaptation
Revisiting Generative Replay for Class Incremental Object Detection
BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models
HandOS: 3D Hand Reconstruction in One Stage
PatchDEMUX: A Certifiably Robust Framework for Multi-label Classifiers Against Adversarial Patches
StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation
Self-Supervised Cross-View Correspondence with Predictive Cycle Consistency
Cross-modal Causal Relation Alignment for Video Question Grounding
DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering
WISH: Weakly Supervised Instance Segmentation using Heterogeneous Labels
ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way
DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis
All-directional Disparity Estimation for Real-world QPD Images
Universal Scene Graph Generation
Interleaved-Modal Chain-of-Thought
CoMBO: Conflict Mitigation via Branched Optimization for Class Incremental Segmentation
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models
TAET: Two-Stage Adversarial Equalization Training on Long-Tailed Distributions
Removing Reflections from RAW Photos
Open Set Label Shift with Test Time Out-of-Distribution Reference
MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations
CustomKD: Customizing Large Vision Foundation for Edge Model Improvement via Knowledge Distillation
Quad-Pixel Image Defocus Deblurring: A New Benchmark and Model
Augmented Deep Contexts for Spatially Embedded Video Coding
SCSegamba: Lightweight Structure-Aware Vision Mamba for Crack Segmentation in Structures
A Semantic Knowledge Complementarity based Decoupling Framework for Semi-supervised Class-imbalanced Medical Image Segmentation
Towards Practical Real-Time Neural Video Compression
LeanGaussian: Breaking Pixel or Point Cloud Correspondence in Modeling 3D Gaussians
On Denoising Walking Videos for Gait Recognition
UniHOPE: A Unified Approach for Hand-Only and Hand-Object Pose Estimation
Degradation-Aware Feature Perturbation for All-in-One Image Restoration
HistoFS: Non-IID Histopathologic Whole Slide Image Classification via Federated Style Transfer with RoI-Preserving
CocoER: Aligning Multi-Level Feature by Competition and Coordination for Emotion Recognition
SSHNet: Unsupervised Cross-modal Homography Estimation via Problem Reformulation and Split Optimization
VIRES: Video Instance Repainting via Sketch and Text Guided Generation
Action Detail Matters: Refining Video Recognition with Local Action Queries
Distraction is All You Need for Multimodal Large Language Model Jailbreaking
AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion
FilmComposer: LLM-Driven Music Production for Silent Film Clips
LoRA Recycle: Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models by Recycling Pre-Tuned LoRAs
DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving
Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning
SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration
SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images
DEIM: DETR with Improved Matching for Fast Convergence
Physical Plausibility-aware Trajectory Prediction via Locomotion Embodiment
Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport
All-Day Multi-Camera Multi-Target Tracking
Coherent 3D Portrait Video Reconstruction via Triplane Fusion
GarmentPile: Point-Level Visual Affordance Guided Retrieval and Adaptation for Cluttered Garments Manipulation
Uncertainty-guided Perturbation for Image Super-Resolution Diffusion Model
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE
ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models
Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation
3D Dental Model Segmentation with Geometrical Boundary Preserving
Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment
Anchor-Aware Similarity Cohesion in Target Frames Enables Predicting Temporal Moment Boundaries in 2D
Structured 3D Latents for Scalable and Versatile 3D Generation
GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior
Tokenize Image Patches: Global Context Fusion for Effective Haze Removal in Large Images
Pippo: High-Resolution Multi-View Humans from a Single Image
Vid2Avatar-Pro: Authentic Avatar from Videos in the Wild via Universal Prior
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation
EgoLM: Multi-Modal Language Model of Egocentric Motions
Distilling Long-tailed Datasets
DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models
A Selective Re-learning Mechanism for Hyperspectral Fusion Imaging
FlashGS: Efficient 3D Gaussian Splatting for Large-scale and High-resolution Rendering
Comprehensive Information Bottleneck for Unveiling Universal Attribution to Interpret Vision Transformers
Interactive Medical Image Segmentation: A Benchmark Dataset and Baseline
Motion Modes: What Could Happen Next?
ASIGN: An Anatomy-aware Spatial Imputation Graphic Network for 3D Spatial Transcriptomics
DiGIT: Multi-Dilated Gated Encoder and Central-Adjacent Region Integrated Decoder for Temporal Action Detection Transformer
Real-time High-fidelity Gaussian Human Avatars with Position-based Interpolation of Spatially Distributed MLPs
Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning
Rethinking Reconstruction and Denoising in the Dark: New Perspective, General Architecture and Beyond
TaoAvatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting
Feature Selection for Latent Factor Models
Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
Probability Density Geodesics in Image Diffusion Latent Space
Dynamic Neural Surfaces for Elastic 4D Shape Representation and Analysis
Quantization without Tears
Bridging Past and Future: End-to-End Autonomous Driving with Historical Prediction and Planning
Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding
Bayesian Test-Time Adaptation for Vision-Language Models
MagicArticulate: Make Your 3D Models Articulation-Ready
End-to-End HOI Reconstruction Transformer with Graph-based Encoding
Mosaic of Modalities: A Comprehensive Benchmark for Multimodal Graph Learning
Argus: A Compact and Versatile Foundation Model for Vision
ShapeShifter: 3D Variations Using Multiscale and Sparse Point-Voxel Diffusion
Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget
AdaCM^2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction
GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing
SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment
Easy-editable Image Vectorization with Multi-layer Multi-scale Distributed Visual Feature Embedding
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
HiLoTs: High-Low Temporal Sensitive Representation Learning for Semi-Supervised LiDAR Segmentation in Autonomous Driving
IRGS: Inter-Reflective Gaussian Splatting with 2D Gaussian Ray Tracing
ROLL: Robust Noisy Pseudo-label Learning for Multi-View Clustering with Noisy Correspondence
TensoFlow: Tensorial Flow-based Sampler for Inverse Rendering
Fuzzy Multimodal Learning for Trusted Cross-modal Retrieval
Breaking the Low-Rank Dilemma of Linear Attention
RAD: Region-Aware Diffusion Models for Image Inpainting
On the Zero-shot Adversarial Robustness of Vision-Language Models: A Truly Zero-shot and Training-free Approach
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Anatomical Consistency and Adaptive Prior-informed Transformation for Multi-contrast MR Image Synthesis via Diffusion Model
Mind the Time: Temporally-Controlled Multi-Event Video Generation
Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation
Gazing at Rewards: Eye Movements as a Lens into Human and AI Decision-Making in Hybrid Visual Foraging
VidSeg: Training-free Video Semantic Segmentation based on Diffusion Models
Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding
Gaussian Eigen Models for Human Heads
Can Machines Understand Composition? Dataset and Benchmark for Photographic Image Composition Embedding and Understanding
Synthetic Prior for Few-Shot Drivable Head Avatar Inversion
Let's Chorus: Partner-aware Hybrid Song-Driven 3D Head Animation
Diffusion Bridge: Leveraging Diffusion Model to Reduce the Modality Gap Between Text and Vision for Zero-Shot Image Captioning
LaTexBlend: Scaling Multi-concept Customized Generation with Latent Textual Blending
STiL: Semi-supervised Tabular-Image Learning for Comprehensive Task-Relevant Information Exploration in Multimodal Classification
Stacking Brick by Brick: Aligned Feature Isolation for Incremental Face Forgery Detection
AFL: A Single-Round Analytic Approach for Federated Learning with Pre-trained Models
Project-Probe-Aggregate: Efficient Fine-Tuning for Group Robustness
Link to the Past: Temporal Propagation for Fast 3D Human Reconstruction from Monocular Video
Decision SpikeFormer: Spike-Driven Transformer for Decision Making
PoseBH: Prototypical Multi-Dataset Training Beyond Human Pose Estimation
Steady Progress Beats Stagnation: Mutual Aid of Foundation and Conventional Models in Mixed Domain Semi-Supervised Medical Image Segmentation
Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing
Disentangled Pose and Appearance Guidance for Multi-Pose Generation
ImViD: Immersive Volumetric Videos for Enhanced VR Engagement
Towards Universal AI-Generated Image Detection by Variational Information Bottleneck Network
Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens
SimMotionEdit: Text-Based Human Motion Editing with Motion Similarity Prediction
PanSplat: 4K Panorama Synthesis with Feed-Forward Gaussian Splatting
4DGC: Rate-Aware 4D Gaussian Compression for Efficient Streamable Free-Viewpoint Video
Point Clouds Meets Physics: Dynamic Acoustic Field Fitting Network for Point Cloud Understanding
Revisiting MAE Pre-training for 3D Medical Image Segmentation
FastVLM: Efficient Vision Encoding for Vision Language Models
Differentiable Inverse Rendering with Interpretable Basis BRDFs
Multi-modal Vision Pre-training for Medical Image Analysis
Rethinking Token Reduction with Parameter-Efficient Fine-Tuning in ViT for Pixel-Level Tasks
MVPortrait: Text-Guided Motion and Emotion Control for Multi-view Vivid Portrait Animation
Finer-CAM: Spotting the Difference Reveals Finer Details for Visual Explanation
SGC-Net: Stratified Granular Comparison Network for Open-Vocabulary HOI Detection
LesionLocator: Zero-Shot Universal Tumor Segmentation and Tracking in 3D Whole-Body Imaging
Test-Time Domain Generalization via Universe Learning: A Multi-Graph Matching Approach for Medical Image Segmentation
CXPMRG-Bench: Pre-training and Benchmarking for X-ray Medical Report Generation on CheXpert Plus Dataset
Hyperbolic Category Discovery
LightLoc: Learning Outdoor LiDAR Localization at Light Speed
Object Detection using Event Camera: A MoE Heat Conduction based Detector and A New Benchmark Dataset
RAEncoder: A Label-Free Reversible Adversarial Examples Encoder for Dataset Intellectual Property Protection
SCFlow2: Plug-and-Play Object Pose Refiner with Shape-Constraint Scene Flow
A Flag Decomposition for Hierarchical Datasets
IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models
Hierarchical Features Matter: A Deep Exploration of Progressive Parameterization Method for Dataset Distillation
Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language
Learning on Model Weights using Tree Experts
VSNet: Focusing on the Linguistic Characteristics of Sign Language
Test-Time Fine-Tuning of Image Compression Models for Multi-Task Adaptability
Prof. Robot: Differentiable Robot Rendering Without Static and Self-Collisions
Any6D: Model-free 6D Pose Estimation of Novel Object
Wavelet and Prototype Augmented Query-based Transformer for Pixel-level Surface Defect Detection
Wonderland: Navigating 3D Scenes from a Single Image
ResCLIP: Residual Attention for Training-free Dense Vision-language Inference
DV-Matcher: Deformation-based Non-rigid Point Cloud Matching Guided by Pre-trained Visual Features
Patient-Level Anatomy Meets Scanning-Level Physics: Personalized Federated Low-Dose CT Denoising Empowered by Large Language Model
Recognition-Synergistic Scene Text Editing
Brain-Inspired Spiking Neural Networks for Energy-Efficient Object Detection
Language-Assisted Debiasing and Smoothing for Foundation Model-Based Semi-Supervised Learning
Object-aware Sound Source Localization via Audio-Visual Scene Understanding
Toward Real-world BEV Perception: Depth Uncertainty Estimation via Gaussian Splatting
KVQ: Boosting Video Quality Assessment via Saliency-guided Local Perception
PhD: A ChatGPT-Prompted Visual Hallucination Evaluation Dataset
ReNeg: Learning Negative Embedding with Reward Guidance
CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering
Docopilot: Improving Multimodal Models for Document-Level Understanding
DVHGNN: Multi-Scale Dilated Vision HGNN for Efficient Vision Recognition
Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding
Active Event-based Stereo Vision
Parametric Point Cloud Completion for Polygonal Surface Reconstruction
FIRE: Robust Detection of Diffusion-Generated Images via Frequency-Guided Reconstruction Error
Reference-Based 3D-Aware Image Editing with Triplanes
Decoupling Training-Free Guided Diffusion by ADMM
Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis
Mamba4D: Efficient 4D Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models
Silence is Golden: Leveraging Adversarial Examples to Nullify Audio Control in LDM-based Talking-Head Generation
Event Fields: Capturing Light Fields at High Speed, Resolution, and Dynamic Range
AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval
OmniStereo: Real-time Omnidireactional Depth Estimation with Multiview Fisheye Cameras
MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders
Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression
Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation
Let Samples Speak: Mitigating Spurious Correlation by Exploiting the Clusterness of Samples
Lift3D Policy: Lifting 2D Foundation Models for Robust 3D Robotic Manipulation
Gaussian Splatting for Efficient Satellite Image Photogrammetry
GENIUS: A Generative Framework for Universal Multimodal Search
Subspace Constraint and Contribution Estimation for Heterogeneous Federated Learning
GaussianSpa: An “Optimizing-Sparsifying” Simplification Framework for Compact and High-Quality 3D Gaussian Splatting
MetricGrids: Arbitrary Nonlinear Approximation with Elementary Metric Grids based Implicit Neural Representation
Correlative and Discriminative Label Grouping for Multi-Label Visual Prompt Tuning
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages
Question-Aware Gaussian Experts for Audio-Visual Question Answering
GASP: Gaussian Avatars with Synthetic Priors
SerialGen: Personalized Image Generation by First Standardization Then Personalization
Generative Inbetweening through Frame-wise Conditions-Driven Video Generation
GroupMamba: Efficient Group-Based Visual State Space Model
Hybrid Concept Bottleneck Models
Reconstructing Humans with a Biomechanically Accurate Skeleton
VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step
Stochastic Human Motion Prediction with Memory of Action Transition and Action Characteristic
Improving Transferable Targeted Attacks with Feature Tuning Mixup
vesselFM: A Foundation Model for Universal 3D Blood Vessel Segmentation
SocialMOIF: Multi-Order Intention Fusion for Pedestrian Trajectory Prediction
Mask^2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation
Subnet-Aware Dynamic Supernet Training for Neural Architecture Search
A Unified Approach to Interpreting Self-supervised Pre-training Methods for 3D Point Clouds via Interactions
GeoDepth: From Point-to-Depth to Plane-to-Depth Modeling for Self-Supervised Monocular Depth Estimation
DreamTrack: Dreaming the Future for Multimodal Visual Object Tracking
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text
Don't Shake the Wheel: Momentum-Aware Planning in End-to-End Autonomous Driving
A New Statistical Model of Star Speckles for Learning to Detect and Characterize Exoplanets in Direct Imaging Observations
Efficient Transfer Learning for Video-language Foundation Models
A Focused Human Body Model for Accurate Anthropometric Measurements Extraction
AnimateAnything: Consistent and Controllable Animation for Video Generation
EvEnhancer: Empowering Effectiveness, Efficiency and Generalizability for Continuous Space-Time Video Super-Resolution with Events
Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents
EventSplat: 3D Gaussian Splatting from Moving Event Cameras for Real-time Rendering
Directional Label Diffusion Model for Learning from Noisy Labels
BWFormer: Building Wireframe Reconstruction from Airborne LiDAR Point Cloud with Transformer
Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail
3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion
Identifying and Mitigating Position Bias of Multi-image Vision-Language Models
Rectified Diffusion Guidance for Conditional Generation
Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observations
Shape Abstraction via Marching Differentiable Support Functions
PolarNeXt: Rethink Instance Segmentation with Polar Representation
DyCON: Dynamic Uncertainty-aware Consistency and Contrastive Learning for Semi-supervised Medical Image Segmentation
LSNet: See Large, Focus Small
EasyHOI: Unleashing the Power of Large Models for Reconstructing Hand-Object Interactions in the Wild
SaMam: Style-aware State Space Model for Arbitrary Image Style Transfer
FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views
You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale
RCP-Bench: Benchmarking Robustness for Collaborative Perception Under Diverse Corruptions
AeSPa : Attention-guided Self-supervised Parallel Imaging for MRI Reconstruction
Self-Learning Hyperspectral and Multispectral Image Fusion via Adaptive Residual Guided Subspace Diffusion Model
SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing
PersonaBooth: Personalized Text-to-Motion Generation
T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation
DTOS: Dynamic Time Object Sensing with Large Multimodal Model
ACE: Anti-Editing Concept Erasure in Text-to-Image Models
Reasoning in Visual Navigation of End-to-end Trained Agents: A Dynamical Systems Approach
Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation
Real-IAD D³: A Real-World 2D/Pseudo-3D/3D Dataset for Industrial Anomaly Detection
WeGen: A Unified Model for Interactive Multimodal Generation as We Chat
PO3AD: Predicting Point Offsets toward Better 3D Point Cloud Anomaly Detection
SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models
Ferret: An Efficient Online Continual Learning Framework under Varying Memory Constraints
Redefining
in Dictionary: Towards an Enhanced Semantic Understanding of Creative Generation
Generalized Recorrupted-to-Recorrupted: Self-Supervised Learning Beyond Gaussian Noise
Attend to Not Attended: Structure-then-Detail Token Merging for Post-training DiT Acceleration
Channel-wise Noise Scheduled Diffusion for Inverse Rendering in Indoor Scenes
ShiftwiseConv: Small Convolutional Kernel with Large Kernel Effect
Higher-Order Ratio Cycles for Fast and Globally Optimal Shape Matching
Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion
FDS: Frequency-Aware Denoising Score for Text-Guided Latent Diffusion Image Editing
Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels
WAVE: Weight Templates for Adaptive Initialization of Variable-sized Models
Progressive Correspondence Regenerator for Robust 3D Registration
AesthetiQ: Enhancing Graphic Layout Design via Aesthetic-Aware Preference Alignment of Multi-modal Large Language Models
MoEdit: On Learning Quantity Perception for Multi-object Image Editing
STINR: Deciphering Spatial Transcriptomics via Implicit Neural Representation
Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation
Matrix3D: Large Photogrammetry Model All-in-One
Tora: Trajectory-oriented Diffusion Transformer for Video Generation
RigGS: Rigging of 3D Gaussians for Modeling Articulated Objects in Videos
CholecTrack20: A Multi-Perspective Tracking Dataset for Surgical Tools
Beyond Single-Modal Boundary: Cross-Modal Anomaly Detection through Visual Prototype and Harmonization
Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models
ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding
PEACE: Empowering Geologic Map Holistic Understanding with MLLMs
Point2RBox-v2: Rethinking Point-supervised Oriented Object Detection with Spatial Layout Among Instances
Stable Flow: Vital Layers for Training-Free Image Editing
Charm: The Missing Piece in ViT Fine-Tuning for Image Aesthetic Assessment
Toward Generalized Image Quality Assessment: Relaxing the Perfect Reference Quality Assumption
HUNet: Homotopy Unfolding Network for Image Compressive Sensing
HyperGS: Hyperspectral 3D Gaussian Splatting
Joint Vision-Language Social Bias Removal for CLIP
Adaptive Non-Uniform Timestep Sampling for Accelerating Diffusion Model Training
SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion
Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking
SharpDepth: Sharpening Metric Depth Predictions Using Diffusion Distillation
ODA-GAN: Orthogonal Decoupling Alignment GAN Assisted by Weakly-supervised Learning for Virtual Immunohistochemistry Staining
ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark
RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives
Open-World Amodal Appearance Completion
Bringing CLIP to the Clinic: Dynamic Soft Labels and Negation-Aware Learning for Medical Analysis
ROICtrl: Boosting Instance Control for Visual Generation
Novel View Synthesis with Pixel-Space Diffusion Models
Parallel Sequence Modeling via Generalized Spatial Propagation Network
HybridMQA: Exploring Geometry-Texture Interactions for Colored Mesh Quality Assessment
Audio-Visual Semantic Graph Network for Audio-Visual Event Localization
DAGSM: Disentangled Avatar Generation with GS-enhanced Mesh
Vision-Guided Action: Enhancing 3D Human Motion Prediction with Gaze-informed Affordance in 3D Scenes
Open-Canopy: Towards Very High Resolution Forest Monitoring
Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy
SFDM: Robust Decomposition of Geometry and Reflectance for Realistic Face Rendering from Sparse-view Images
Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation
Handling Spatial-Temporal Data Heterogeneity for Federated Continual Learning via Tail Anchor
Robust Message Embedding via Attention Flow-Based Steganography
OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection
Towards High-fidelity 3D Talking Avatar with Personalized Dynamic Texture
TSAM: Temporal SAM Augmented with Multimodal Prompts for Referring Audio-Visual Segmentation
DORNet: A Degradation Oriented and Regularized Network for Blind Depth Super-Resolution
LEDiff: Latent Exposure Diffusion for HDR Generation
BIGS: Bimanual Category-agnostic Interaction Reconstruction from Monocular Videos via 3D Gaussian Splatting
Unified Medical Lesion Segmentation via Self-referring Indicator
iSegMan: Interactive Segment-and-Manipulate 3D Gaussians
MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments
Towards Open-Vocabulary Audio-Visual Event Localization
GaPT-DAR: Category-level Garments Pose Tracking via Integrated 2D Deformation and 3D Reconstruction
Not All Parameters Matter: Masking Diffusion Models for Enhancing Generation Ability
Q-PART: Quasi-Periodic Adaptive Regression with Test-time Training for Pediatric Left Ventricular Ejection Fraction Regression
Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking
Simplification Is All You Need against Out-of-Distribution Overconfidence
SOGS: Second-Order Anchor for Advanced 3D Gaussian Splatting
Enduring, Efficient and Robust Trajectory Prediction Attack in Autonomous Driving via Optimization-Driven Multi-Frame Perturbation Framework
3D Occupancy Prediction with Low-Resolution Queries via Prototype-aware View Transformation
Guiding Human-Object Interactions with Rich Geometry and Relations
Distribution Prototype Diffusion Learning for Open-set Supervised Anomaly Detection
GauCho: Gaussian Distributions with Cholesky Decomposition for Oriented Object Detection
SuperLightNet: Lightweight Parameter Aggregation Network for Multimodal Brain Tumor Segmentation
KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation
A3: Few-shot Prompt Learning of Unlearnable Examples with Cross-Modal Adversarial Feature Alignment
Task-Agnostic Guided Feature Expansion for Class-Incremental Learning
Effortless Active Labeling for Long-Term Test-Time Adaptation
Learning Extremely High Density Crowds as Active Matters
3D Student Splatting and Scooping
Dynamic Updates for Language Adaptation in Visual-Language Tracking
Generative Hard Example Augmentation for Semantic Point Cloud Segmentation
Frequency Dynamic Convolution for Dense Image Prediction
Gazing Into Missteps: Leveraging Eye-Gaze for Unsupervised Mistake Detection in Egocentric Videos of Skilled Human Activities
4Deform: Neural Surface Deformation for Robust Shape Interpolation
Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
CORE4D: A 4D Human-Object-Human Interaction Dataset for Collaborative Object REarrangement
Gradient-Guided Annealing for Domain Generalization
ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning
Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models
Morpheus: Text-Driven 3D Gaussian Splat Shape and Color Stylization
Towards Training-free Anomaly Detection with Vision and Language Foundation Models
Detecting Open World Objects via Partial Attribute Assignment
EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing
Repurposing Stable Diffusion Attention for Training-Free Unsupervised Interactive Segmentation
Exploring CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation
Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
Dual-view X-ray Detection: Can AI Detect Prohibited Items from Dual-view X-ray Images like Humans?
ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large Language Models
WeatherGen: A Unified Diverse Weather Generator for LiDAR Point Clouds via Spider Mamba Diffusion
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation
GliaNet: Adaptive Neural Network Structure Learning with Glia-Driven
Improving Visual and Downstream Performance of Low-Light Enhancer with Vision Foundation Models Collaboration
Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval
BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation
SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding
Learning Heterogeneous Tissues with Mixture of Experts for Gigapixel Whole Slide Images
MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation
ZoomLDM: Latent Diffusion Model for Multi-scale Image Generation
Segment Anything, Even Occluded
Focus-N-Fix: Region-Aware Fine-Tuning for Text-to-Image Generation
Volumetrically Consistent 3D Gaussian Rasterization
Adaptive Rectangular Convolution for Remote Sensing Pansharpening
Weakly Supervised Temporal Action Localization via Dual-Prior Collaborative Learning Guided by Multimodal Large Language Models
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
Single Domain Generalization for Few-Shot Counting via Universal Representation Matching
FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation
Vision-Language Model IP Protection via Prompt-based Learning
DIFFER: Disentangling Identity Features via Semantic Cues for Clothes-Changing Person Re-ID
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
MobilePortrait: Real-Time One-Shot Neural Head Avatars on Mobile Devices
UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference Image
Real-time Free-view Human Rendering from Sparse-view RGB Videos using Double Unprojected Textures
SLADE: Shielding against Dual Exploits in Large Vision-Language Models
CSC-PA: Cross-image Semantic Correlation via Prototype Attentions for Single-network Semi-supervised Breast Tumor Segmentation
Acc3D: Accelerating Single Image to 3D Diffusion Models via Edge Consistency Guided Score Distillation
Generative Omnimatte: Learning to Decompose Video into Layers
Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model
SketchVideo: Sketch-based Video Generation and Editing
Do Your Best and Get Enough Rest for Continual Learning
GenPC: Zero-shot Point Cloud Completion via 3D Generative Priors
From Sparse to Dense: Camera Relocalization with Scene-Specific Detector from Feature Gaussian Splatting
EchoONE: Segmenting Multiple Echocardiography Planes in One Model
EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation
Scaling Mesh Generation via Compressive Tokenization
FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model
Rectification-specific Supervision and Constrained Estimator for Online Stereo Rectification
Beyond Image Classification: A Video Benchmark and Dual-Branch Hybrid Discrimination Framework for Compositional Zero-Shot Learning
Convex Combination Star Shape Prior for Data-driven Image Semantic Segmentation
PIAD: Pose and Illumination agnostic Anomaly Detection
Hyperbolic Uncertainty-Aware Few-Shot Incremental Point Cloud Segmentation
DropGaussian: Structural Regularization for Sparse-view Gaussian Splatting
GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration
UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing
Asynchronous Collaborative Graph Representation for Frames and Events
Explicit Depth-Aware Blurry Video Frame Interpolation Guided by Differential Curves
Are Spatial-Temporal Graph Convolution Networks for Human Action Recognition Over-Parameterized?
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs
Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction
High Temporal Consistency through Semantic Similarity Propagation in Semi-Supervised Video Semantic Segmentation for Autonomous Flight
Decoupling Fine Detail and Global Geometry for Compressed Depth Map Super-Resolution
LAL: Enhancing 3D Human Motion Prediction with Latency-aware Auxiliary Learning
CoSER: Towards Consistent Dense Multiview Text-to-Image Generator for 3D Creation
CamPoint: Boosting Point Cloud Segmentation with Virtual Camera
MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models
Collaborative Tree Search for Enhancing Embodied Multi-Agent Collaboration
DeepLA-Net: Very Deep Local Aggregation Networks for Point Cloud Analysis
HotSpot: Signed Distance Function Optimization with an Asymptotically Sufficient Condition
Three Cars Approaching within 100m! Enhancing Distant Geometry by Tri-Axis Voxel Scanning for Camera-based Semantic Scene Completion
Continuous Locomotive Crowd Behavior Generation
CountLLM: Towards Generalizable Repetitive Action Counting via Large Language Model
Masking meets Supervision: A Strong Learning Alliance
Generalizable Object Keypoint Localization from Generative Priors
EdgeDiff: Edge-aware Diffusion Network for Building Reconstruction from Point Clouds
REWIND: Real-Time Egocentric Whole-Body Motion Diffusion with Exemplar-Based Identity Conditioning
SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images
Free-viewpoint Human Animation with Pose-correlated Reference Selection
Invisible Backdoor Attack against Self-supervised Learning
Dynamic Integration of Task-Specific Adapters for Class Incremental Learning
QuCOOP: A Versatile Framework for Solving Composite and Binary-Parametrised Problems on Quantum Annealers
High-fidelity 3D Object Generation from Single Image with RGBN-Volume Gaussian Reconstruction Model
CorrBEV: Multi-View 3D Object Detection by Correlation Learning with Multi-modal Prototypes
WISNet: Pseudo Label Generation on Unbalanced and Patch Annotated Waste Images
Mitigating Ambiguities in 3D Classification with Gaussian Splatting
An Image-like Diffusion Method for Human-Object Interaction Detection
A Dataset for Semantic Segmentation in the Presence of Unknowns
Neuro-3D: Towards 3D Visual Decoding from EEG Signals
GazeGene: Large-scale Synthetic Gaze Dataset with 3D Eyeball Annotations
Uncertain Multimodal Intention and Emotion Understanding in the Wild
Leveraging Global Stereo Consistency for Category-Level Shape and 6D Pose Estimation from Stereo Images
MP-GUI: Modality Perception with MLLMs for GUI Understanding
Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation
Order-Robust Class Incremental Learning: Graph-Driven Dynamic Similarity Grouping
CGMatch: A Different Perspective of Semi-supervised Learning
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
DViN: Dynamic Visual Routing Network for Weakly Supervised Referring Expression Comprehension
WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and Segmentation
HumanMM: Global Human Motion Recovery from Multi-shot Videos
Towards Stable and Storage-efficient Dataset Distillation: Matching Convexified Trajectory
SCSA: A Plug-and-Play Semantic Continuous-Sparse Attention for Arbitrary Semantic Style Transfer
PreciseCam: Precise Camera Control for Text-to-Image Generation
Generalized Diffusion Detector: Mining Robust Features from Diffusion Models for Domain-Generalized Detection
Learning from Synchronization: Self-Supervised Uncalibrated Multi-View Person Association in Challenging Scenes
Self-supervised ControlNet with Spatio-Temporal Mamba for Real-world Video Super-resolution
GaussHDR: High Dynamic Range Gaussian Splatting via Learning Unified 3D and 2D Local Tone Mapping
Image Quality Assessment: Investigating Causal Perceptual Effects with Abductive Counterfactual Inference
Perceptual Video Compression with Neural Wrapping
Unified Uncertainty-Aware Diffusion for Multi-Agent Trajectory Modeling
Samba: A Unified Mamba-based Framework for General Salient Object Detection
EBS-EKF: Accurate and High Frequency Event-based Star Tracking
MATCHA: Towards Matching Anything
FedAWA: Adaptive Optimization of Aggregation Weights in Federated Learning Using Client Vectors
CASAGPT: Cuboid Arrangement and Scene Assembly for Interior Design
Do We Always Need the Simplicity Bias? Looking for Optimal Inductive Biases in the Wild
High-Fidelity Lightweight Mesh Reconstruction from Point Clouds
BOOTPLACE: Bootstrapped Object Placement with Detection Transformers
SpatialDreamer: Self-supervised Stereo Video Synthesis from Monocular Input
Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers
Circumventing Shortcuts in Audio-visual Deepfake Detection Datasets with Unsupervised Learning
Less is More: Efficient Image Vectorization with Adaptive Parameterization
Dual Consolidation for Pre-Trained Model-Based Domain-Incremental Learning
SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation
Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis
Using Powerful Prior Knowledge of Diffusion Model in Deep Unfolding Networks for Image Compressive Sensing
Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think
Large-scale Multi-view Tensor Clustering with Implicit Linear Kernels
NTR-Gaussian: Nighttime Dynamic Thermal Reconstruction with 4D Gaussian Splatting Based on Thermodynamics
Tiled Diffusion
S^3-Face: SSS-Compliant Facial Reflectance Estimation via Diffusion Priors
Domain Adaptive Diabetic Retinopathy Grading with Model Absence and Flowing Data
Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions
CleanDIFT: Diffusion Features without Noise
SmartCLIP: Modular Vision-language Alignment with Identification Guarantees
Adversarial Diffusion Compression for Real-World Image Super-Resolution
From Sparse Signal to Smooth Motion: Real-Time Motion Generation with Rolling Prediction Models
MixerMDM: Learnable Composition of Human Motion Diffusion Models
Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training
Efficient ANN-Guided Distillation: Aligning Rate-based Features of Spiking Neural Networks through Hybrid Block-wise Replacement
Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion
DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers
Not Just Text: Uncovering Vision Modality Typographic Threats in Image Generation Models
DropoutGS: Dropping Out Gaussians for Better Sparse-view Rendering
Multi-subject Open-set Personalization in Video Generation
Distinguish Then Exploit: Source-free Open Set Domain Adaptation via Weight Barcode Estimation and Sparse Label Assignment
DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models
GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control
Diffusion Renderer: Neural Inverse and Forward Rendering with Video Diffusion Models
CADRef: Robust Out-of-Distribution Detection via Class-Aware Decoupled Relative Feature Leveraging
STAR-Edge: Structure-aware Local Spherical Curve Representation for Thin-walled Edge Extraction from Unstructured Point Clouds
WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model
SmartEraser: Remove Anything from Images using Masked-Region Guidance
Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression
TAMT: Temporal-Aware Model Tuning for Cross-Domain Few-Shot Action Recognition
LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant
MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos
Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos
DiSRT-In-Bed: Diffusion-Based Sim-to-Real Transfer Framework for In-Bed Human Mesh Recovery
Sample- and Parameter-Efficient Auto-Regressive Image Models
LongDiff: Training-Free Long Video Generation in One Go
LLaVA-Critic: Learning to Evaluate Multimodal Models
Motion Prompting: Controlling Video Generation with Motion Trajectories
EigenGS Representation: From Eigenspace to Gaussian Image Space
AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers
Panorama Generation From NFoV Image Done Right
MERGE: Multi-faceted Hierarchical Graph-based GNN for Gene Expression Prediction from Whole Slide Histopathology Images
Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing
Ego4o: Egocentric Human Motion Capture and Understanding from Multi-Modal Input
Sketch Down the FLOPs: Towards Efficient Networks for Human Sketch
Ges3ViG : Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding
Unveiling Differences in Generative Models: A Scalable Differential Clustering Approach
DIO: Decomposable Implicit 4D Occupancy-Flow World Model
Assessing and Learning Alignment of Unimodal Vision and Language Models
Temporally Consistent Object-Centric Learning by Contrasting Slots
Goku: Flow Based Video Generative Foundation Models
Spectral Informed Mamba for Robust Point Cloud Processing
UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics
Multitwine: Multi-Object Compositing with Text and Layout Control
MetaShadow: Object-Centered Shadow Detection, Removal, and Synthesis
Detecting Out-of-Distribution Through the Lens of Neural Collapse
Any-Resolution AI-Generated Image Detection by Spectral Learning
Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects
ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis
Efficient Depth Estimation for Unstable Stereo Camera Systems on AR Glasses
Spiking Transformer: Introducing Accurate Addition-Only Spiking Self-Attention for Transformer
Infighting in the Dark: Multi-Label Backdoor Attack in Federated Learning
GeoAvatar: Geometrically-Consistent Multi-Person Avatar Reconstruction from Sparse Multi-View Videos
ReDiffDet: Rotation-equivariant Diffusion Model for Oriented Object Detection
Color Alignment in Diffusion
JamMa: Ultra-lightweight Local Feature Matching with Joint Mamba
PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
ORIDa: Object-centric Real-world Image Composition Dataset
Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents
SimLTD: Simple Supervised and Semi-Supervised Long-Tailed Object Detection
NLPrompt: Noise-Label Prompt Learning for Vision-Language Models
Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions
Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method
CMMLoc: Advancing Text-to-PointCloud Localization with Cauchy-Mixture-Model Based Framework
A Polarization-Aided Transformer for Image Deblurring via Motion Vector Decomposition
Seek Common Ground While Reserving Differences: Semi-Supervised Image-Text Sentiment Recognition
Omni-Scene: Omni-Gaussian Representation for Ego-Centric Sparse-View Scene Reconstruction
Towards Fine-Grained Interpretability: Counterfactual Explanations for Misclassification with Saliency Partition
Search and Detect: Training-Free Long Tail Object Detection via Web-Image Retrieval
Harnessing Global-Local Collaborative Adversarial Perturbation for Anti-Customization
ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration
DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation
HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation
Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective
A Simple yet Effective Layout Token in Large Language Models for Document Understanding
HeatFormer: A Neural Optimizer for Multiview Human Mesh Recovery
MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views
Dual-Interrelated Diffusion Model for Few-Shot Anomaly Image Generation
Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing
Universal Actions for Enhanced Embodied Foundation Models
How Do I Do That? Synthesizing 3D Hand Motion and Contacts for Everyday Interactions
Unsupervised Discovery of Facial Landmarks and Head Pose
On the Consistency of Video Large Language Models in Temporal Comprehension
3DEnhancer: Consistent Multi-View Diffusion for 3D Enhancement
Maintaining Consistent Inter-Class Topology in Continual Test-Time Adaptation
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Effective SAM Combination for Open-Vocabulary Semantic Segmentation
Exploring Intrinsic Normal Prototypes within a Single Image for Universal Anomaly Detection
GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model
Sonic: Shifting Focus to Global Audio Perception in Portrait Animation
SVFR: A Unified Framework for Generalized Video Face Restoration
Hunyuan-Portrait: Implicit Condition Control for Enhanced Portrait Animation
Take the Bull by the Horns: Learning to Segment Hard Samples
DSV-LFS: Unifying LLM-Driven Semantic Cues with Visual Features for Robust Few-Shot Segmentation
Understanding Multi-layered Transmission Matrices
Self-Supervised Spatial Correspondence Across Modalities
Bridge the Gap: From Weak to Full Supervision for Temporal Action Localization with PseudoFormer
Percept, Memory, and Imagine: World Feature Simulating for Open-Domain Unknown Object Detection
Token Cropr: Faster ViTs for Quite a Few Tasks
IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments
DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation
DiffFNO: Diffusion Fourier Neural Operator
Matrix-Free Shared Intrinsics Bundle Adjustment
End-to-End Implicit Neural Representations for Classification
Online Task-Free Continual Learning via Dynamic Expansionable Memory Distribution
DiET-GS: Diffusion Prior and Event Stream-Assisted Motion Deblurring 3D Gaussian Splatting
ViKIENet: Towards Efficient 3D Object Detection with Virtual Key Instance Enhanced Network
Sea-ing in Low-light
DynaMoDe-NeRF: Motion-aware Deblurring Neural Radiance Field for Dynamic Scenes
GPS as a Control Signal for Image Generation
Dynamic Content Prediction with Motion-aware Priors for Blind Face Video Restoration
Seeing is Not Believing: Adversarial Natural Object Optimization for Hard-Label 3D Scene Attacks
Once-Tuning-Multiple-Variants: Tuning Once and Expanded as Multiple Vision-Language Model Variants
Occlusion-aware Text-Image-Point Cloud Pretraining for Open-World 3D Object Recognition
On the Generalization of Handwritten Text Recognition Models
SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation
Doppelgängers and Adversarial Vulnerability
Adapting to Observation Length of Trajectory Prediction via Contrastive Learning
Dynamic Stereotype Theory Induced Micro-expression Recognition with Oriented Deformation
LC-Mamba: Local and Continuous Mamba with Shifted Windows for Frame Interpolation
Shading Meets Motion: Self-supervised Indoor 3D Reconstruction Via Simultaneous Shape-from-Shading and Structure-from-Motion
Difference Inversion: Interpolate and Isolate the Difference with Token Consistency for Image Analogy Generation
Odd-One-Out: Anomaly Detection by Comparing with Neighbors
DirectTriGS: Triplane-based Gaussian Splatting Field Representation for 3D Generation
4DTAM: Non-Rigid Tracking and Mapping via Dynamic Surface Gaussians
Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning
Rethinking Epistemic and Aleatoric Uncertainty for Active Open-Set Annotation: An Energy-Based Approach
Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks
OffsetOPT: Explicit Surface Reconstruction without Normals
Dual Energy-Based Model with Open-World Uncertainty Estimation for Out-of-distribution Detection
FIFA: Fine-grained Inter-frame Attention for Driver's Video Gaze Estimation
Sketchy Bounding-box Supervision for 3D Instance Segmentation
MuTri: Multi-view Tri-alignment for OCT to OCTA 3D Image Translation
DynScene: Scalable Generation of Dynamic Robotic Manipulation Scenes for Embodied AI
SVG-IR: Spatially-Varying Gaussian Splatting for Inverse Rendering
PERSE: Personalized 3D Generative Avatars from A Single Portrait
Improving Editability in Image Generation with Layer-wise Memory
CDI: Copyrighted Data Identification in Diffusion Models
Devil is in the Detail: Towards Injecting Fine Details of Image Prompt in Image Generation via Conflict-free Guidance and Stratified Attention
Camera Resection from Known Line Pencils and a Radially Distorted Scanline
Community Forensics: Using Thousands of Generators to Train Fake Image Detectors
Hypergraph Vision Transformers: Images are More than Nodes, More than Edges
Enhancing Creative Generation on Stable Diffusion-based Models
Noise Modeling in One Hour: Minimizing Preparation Efforts for Self-supervised Low-Light RAW Image Denoising
SfM-Free 3D Gaussian Splatting via Hierarchical Training
Heterogeneous Skeleton-Based Action Representation Learning
Towards Realistic Example-based Modeling via 3D Gaussian Stitching
DTGBrepGen: A Novel B-rep Generative Model through Decoupling Topology and Geometry
Navigating the Unseen: Zero-shot Scene Graph Generation via Capsule-Based Equivariant Features
Point Cloud Upsampling Using Conditional Diffusion Module with Adaptive Noise Suppression
MambaVision: A Hybrid Mamba-Transformer Vision Backbone
Beyond Clean Training Data: A Versatile and Model-Agnostic Framework for Out-of-Distribution Detection with Contaminated Training Data
ColabSfM: Collaborative Structure-from-Motion by Point Cloud Registration
Correcting Deviations from Normality: A Reformulated Diffusion Model for Multi-Class Unsupervised Anomaly Detection
VL2Lite: Task-Specific Knowledge Distillation from Large Vision-Language Models to Lightweight Networks
Improving Personalized Search with Regularized Low-Rank Parameter Updates
Vision-Language Embodiment for Monocular Depth Estimation
Conformal Prediction for Zero-Shot Models
Auto-Encoded Supervision for Perceptual Image Super-Resolution
Cubify Anything: Scaling Indoor 3D Object Detection
CTRL-D: Controllable Dynamic 3D Scene Editing with Personalized 2D Diffusion
What Makes a Good Dataset for Knowledge Distillation?
PCM : Picard Consistency Model for Fast Parallel Sampling of Diffusion Models
EmoEdit: Evoking Emotions through Image Manipulation
Multiple Object Tracking as ID Prediction
ICE: Intrinsic Concept Extraction from a Single Image via Diffusion Models
Mr. DETR: Instructive Multi-Route Training for Detection Transformers
v-CLR: View-Consistent Learning for Open-World Instance Segmentation
Lifting Motion to the 3D World via 2D Diffusion
Scale Efficient Training for Large Datasets
Recurrent Feature Mining and Keypoint Mixup Padding for Category-Agnostic Pose Estimation
Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes
From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech
Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification
The Art of Deception: Color Visual Illusions and Diffusion Models
Minimizing Labeled, Maximizing Unlabeled: An Image-Driven Approach for Video Instance Segmentation
BF-STVSR: B-Splines and Fourier---Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution
SemanticDraw: Towards Real-Time Interactive Content Creation from Image Diffusion Models
AutoURDF: Unsupervised Robot Modeling from Point Cloud Frames Using Cluster Registration
Do Visual Imaginations Improve Vision-and-Language Navigation Agents?
MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning
Video Summarization with Large Language Models
Consistent Normal Orientation for 3D Point Clouds via Least Squares on Delaunay Graph
Zero-shot RGB-D Point Cloud Registration with Pre-trained Large Vision Model
Exploring Semantic Feature Discrimination for Perceptual Image Super-Resolution and Opinion-Unaware No-Reference Image Quality Assessment
SLVR: Super-Light Visual Reconstruction via Blueprint Controllable Convolutions and Exploring Feature Diversity Representation
Hazy Low-Quality Satellite Video Restoration Via Learning Optimal Joint Degradation Patterns and Continuous-Scale Super-Resolution Reconstruction
MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking
Language-Guided Image Tokenization for Generation
MaSS13K: A Matting-level Semantic Segmentation Benchmark
Self-Expansion of Pre-trained Models with Mixture of Adapters for Continual Learning
PRaDA: Projective Radial Distortion Averaging
GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency
TSP-Mamba: The Travelling Salesman Problem Meets Mamba for Image Super-resolution and Beyond
IncEventGS: Pose-Free Gaussian Splatting from a Single Event Camera
Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image Segmentation
LoRA Subtraction for Drift-Resistant Space in Exemplar-Free Continual Learning
Erase Diffusion: Empowering Object Removal Through Calibrating Diffusion Pathways
Uncertainty Weighted Gradients for Model Calibration
GaussianUDF: Inferring Unsigned Distance Functions through 3D Gaussian Splatting
MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining
Joint Scheduling of Causal Prompts and Tasks for Multi-Task Learning
UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming
EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
Learning Textual Prompts for Open-World Semi-Supervised Learning
RICCARDO: Radar Hit Prediction and Convolution for Camera-Radar 3D Object Detection
Identifying and Mitigating Spurious Correlation in Multi-Task Learning
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Relation3D : Enhancing Relation Modeling for Point Cloud Instance Segmentation
Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models
Style Quantization for Data-Efficient GAN Training
DA-VPT: Semantic-Guided Visual Prompt Tuning for Vision Transformers
Closest Neighbors are Harmful for Lightweight Masked Auto-encoders
Generative Photomontage
RNG: Relightable Neural Gaussians
DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification
Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?
Compositional Targeted Multi-Label Universal Perturbations
Customized Condition Controllable Generation for Video Soundtrack
FASTer: Focal token Acquiring-and-Scaling Transformer for Long-term 3D Objection Detection
SDGOCC: Semantic and Depth-Guided Bird's-Eye View Transformation for 3D Multimodal Occupancy Prediction
Be More Specific: Evaluating Object-centric Realism in Synthetic Images
Cheb-GR: Rethinking K-nearest Neighbor Search in Re-ranking for Person Re-identification
The Impact Label Noise and Choice of Threshold has on Cross-Entropy and Soft-Dice in Image Segmentation
DarkIR: Robust Low-Light Image Restoration
FIction: 4D Future Interaction Prediction from Video
V^2Dial: Unification of Video and Visual Dialog via Multimodal Experts
PhysicsGen: Can Generative Models Learn from Images to Predict Complex Physical Relations?
Improved Monocular Depth Prediction Using Distance Transform Over Pre-semantic Contours with Self-supervised Neural Networks
SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving
A Distractor-Aware Memory for Visual Object Tracking with SAM2
Compositional Caching for Training-free Open-vocabulary Attribute Detection
Zero-Shot Blind-spot Image Denoising via Implicit Neural Sampling
Fingerprinting Denoising Diffusion Probabilistic Models
LotusFilter: Fast Diverse Nearest Neighbor Search via a Learned Cutoff Table
Realistic Test-Time Adaptation of Vision-Language Models
MODA: Motion-Drift Augmentation for Inertial Human Motion Analysis
Saliuitl: Ensemble Salience Guided Recovery of Adversarial Patches against CNNs
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
PURA: Parameter Update-Recovery Test-Time Adaption for RGB-T Tracking
Towards Generalizable Trajectory Prediction using Dual-Level Representation Learning and Adaptive Prompting
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages
Personalized Preference Fine-tuning of Diffusion Models
Feature-Preserving Mesh Decimation for Normal Integration
SAM2Object: Consolidating View Consistency via SAM2 for Zero-Shot 3D Instance Segmentation
Seeking Consistent Flat Minima for Better Domain Generalization via Refining Loss Landscapes
Free Lunch Enhancements for Multi-modal Crowd Counting
EquiPose: Exploiting Permutation Equivariance for Relative Camera Pose Estimation
Exploring Timeline Control for Facial Motion Generation
Ev-3DOD: Pushing the Temporal Boundaries of 3D Object Detection with Event Cameras
DiffLO: Semantic-Aware LiDAR Odometry with Diffusion-Based Refinement
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
TADFormer: Task-Adaptive Dynamic TransFormer for Efficient Multi-Task Learning
Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment
HalLoc: Token-level Localization of Hallucinations for Vision Language Models
SAIST: Segment Any Infrared Small Target Model Guided by Contrastive Language-Image Pretraining
U-Know-DiffPAN: An Uncertainty-aware Knowledge Distillation Diffusion Framework with Details Enhancement for PAN-Sharpening
SplineGS: Robust Motion-Adaptive Spline for Real-Time Dynamic 3D Gaussians from Monocular Video
Reanimating Images using Neural Representations of Dynamic Stimuli
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
AnyMap: Learning a General Camera Model for Structure-from-Motion with Unknown Distortion in Dynamic Scenes
SGSST: Scaling Gaussian Splatting Style Transfer
FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs
UNEM: UNrolled Generalized EM for Transductive Few-Shot Learning
HyperNVD: Accelerating Neural Video Decomposition via Hypernetworks
Neural Inverse Rendering from Propagating Light
Leveraging 3D Geometric Priors in 2D Rotation Symmetry Detection
3D-HGS: 3D Half-Gaussian Splatting
TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation
Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization
Creating Your Editable 3D Photorealistic Avatar with Tetrahedron-constrained Gaussian Splatting
BioX-CPath: Biologically-driven Explainable Diagnostics for Multistain IHC Computational Pathology
FreqDebias: Towards Generalizable Deepfake Detection via Consistency-Driven Frequency Debiasing
Optical-Flow Guided Prompt Optimization for Coherent Video Generation
ViiNeuS: Volumetric Initialization for Implicit Neural Surface Reconstruction of Urban Scenes with Limited Image Overlap
Plug-and-Play Interpretable Responsible Text-to-Image Generation via Dual-Space Multi-facet Concept Control
The Illusion of Unlearning: The Unstable Nature of Machine Unlearning in Text-to-Image Diffusion Models
FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement
Learning to Filter Outlier Edges in Global SfM
Enhancing Virtual Try-On with Synthetic Pairs and Error-Aware Noise Scheduling
Generative Modeling of Class Probability for Multi-Modal Representation Learning
Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations
Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
RefPose: Leveraging Reference Geometric Correspondences for Accurate 6D Pose Estimation of Unseen Objects
Unlocking the Potential of Unlabeled Data in Semi-Supervised Domain Generalization
AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modalities
Boost Your Human Image Generation Model via Direct Preference Optimization
Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation
It’s a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data
Multi-View Pose-Agnostic Change Localization with Zero Labels
Mind the Gap: Detecting Black-box Adversarial Attacks in the Making through Query Update Analysis
Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation
MultiMorph: On-demand Atlas Construction
Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models
Learning Physics From Video: Unsupervised Physical Parameter Estimation for Continuous Dynamical Systems
DeClotH: Decomposable 3D Cloth and Human Body Reconstruction from a Single Image
DepthCues: Evaluating Monocular Depth Perception in Large Vision Models
MVSAnywhere: Zero-Shot Multi-View Stereo
Keyframe-Guided Creative Video Inpainting
Multi-Modal Contrastive Masked Autoencoders: A Two-Stage Progressive Pre-training Approach for RGBD Datasets
TailedCore: Few-Shot Sampling for Unsupervised Long-Tail Noisy Anomaly Detection
Conditional Balance: Improving Multi-Conditioning Trade-Offs in Image Generation
Graph-Based 3D Lane Detection from Monocular Images
GOAL: Global-local Object Alignment Learning
Order-One Rolling Shutter Cameras
Deterministic Certification of Graph Neural Networks against Graph Poisoning Attacks with Arbitrary Perturbations
Dual Focus-Attention Transformer for Robust Point Cloud Registration
Incorporating Dense Knowledge Alignment into Unified Multimodal Representation Models
VideoHandles: Editing 3D Object Compositions in Videos Using Video Generative Priors
A Tale of Two Classes: Adapting Supervised Contrastive Learning to Binary Imbalanced Datasets
Robust Multi-Object 4D Generation for In-the-wild Videos
Generalized Gaussian Entropy Model for Point Cloud Attribute Compression with Dynamic Likelihood Intervals
Descriptor-In-Pixel : Point-Feature Tracking For Pixel Processor Arrays
PosterO: Structuring Layout Trees to Enable Language Models in Generalized Content-Aware Layout Generation
Pioneering 4-Bit FP Quantization for Diffusion Models: Mixup-Sign Quantization and Timestep-Aware Fine-Tuning
Learning with Noisy Triplet Correspondence for Composed Image Retrieval
DKC: Differentiated Knowledge Consolidation for Cloth-Hybrid Lifelong Person Re-identification
Global-Local Tree Search in VLMs for 3D Indoor Scene Generation
Learning to Sample Effective and Diverse Prompts for Text-to-Image Generation
Universal Domain Adaptation for Semantic Segmentation
Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding
SyncSDE: A Probabilistic Framework for Diffusion Synchronization
Towards Human-Understandable Multi-Dimensional Concept Discovery
Beyond Local Sharpness: Communication-Efficient Global Sharpness-aware Minimization for Federated Learning
Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation
EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality
VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding
VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide
Unraveling Normal Anatomy via Fluid-Driven Anomaly Randomization
RADIOv2.5: Improved Baselines for Agglomerative Vision Foundation Models
Functionality Understanding and Segmentation in 3D Scenes
F^3OCUS - Federated Finetuning of Vision-Language Foundation Models with Optimal Client Layer Updating Strategy via Multi-objective Meta-Heuristics
ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval
Generative Multiview Relighting for 3D Reconstruction under Extreme Illumination Variation
AniGrad: Anisotropic Gradient-Adaptive Sampling for 3D Reconstruction From Monocular Video
Enhancing 3D Gaze Estimation in the Wild using Weak Supervision with Gaze Following Labels
Towards More General Video-based Deepfake Detection through Facial Component Guided Adaptation for Foundation Model
ProbPose: A Probabilistic Approach to 2D Human Pose Estimation
SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost
Integral Fast Fourier Color Constancy
CacheQuant: Comprehensively Accelerated Diffusion Models
Deterministic Image-to-Image Translation via Denoising Brownian Bridge Models with Dual Approximators
Semantic-guided Cross-Modal Prompt Learning for Skeleton-based Zero-shot Action Recognition
ONDA-Pose: Occlusion-Aware Neural Domain Adaptation for Self-Supervised 6D Object Pose Estimation
T2SG: Traffic Topology Scene Graph for Topology Reasoning in Autonomous Driving
Text Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps
Coeff-Tuning: A Graph Filter Subspace View for Tuning Attention-Based Large Models
ArcPro: Architectural Programs for Structured 3D Abstraction of Sparse Points
LT3SD: Latent Trees for 3D Scene Diffusion
Classifier-to-Bias: Toward Unsupervised Automatic Bias Detection for Visual Classifiers
EVPGS: Enhanced View Prior Guidance for Splatting-based Extrapolated View Synthesis
Explainable Saliency: Articulating Reasoning with Contextual Prioritization
EffiDec3D: An Optimized Decoder for High-Performance and Efficient 3D Medical Image Segmentation
CTRL-O: Language-Controllable Object-Centric Visual Representation Learning
ReRAW: RGB-to-RAW Image Reconstruction via Stratified Sampling for Efficient Object Detection on the Edge
PQPP: A Joint Benchmark for Text-to-Image Prompt and Query Performance Prediction
RipVIS: Rip Currents Video Instance Segmentation Benchmark for Beach Monitoring and Safety
Stop Walking in Circles! Bailing Out Early in Projected Gradient Descent
Geometry in Style: 3D Stylization via Surface Normal Deformation
Locally Orderless Images for Optimization in Differentiable Rendering
Binarized Neural Network for Multi-spectral Image Fusion
Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback
ArtFormer: Controllable Generation of Diverse 3D Articulated Objects
Extreme Rotation Estimation in the Wild
Condensing Action Segmentation Datasets via Generative Network Inversion
Reconstructing People, Places, and Cameras
LIM: Large Interpolator Model for Dynamic Reconstruction
Hand-held Object Reconstruction from RGB Video with Dynamic Interaction
DiTASK: Multi-Task Fine-Tuning with Diffeomorphic Transformations
LoKi: Low-dimensional KAN for Efficient Fine-tuning Image Models
Targeted Forgetting of Image Subgroups in CLIP Models
No Thing, Nothing: Highlighting Safety-Critical Classes for Robust LiDAR Semantic Segmentation in Adverse Weather
MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors
DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Videos
Data Distributional Properties As Inductive Bias for Systematic Generalization
ParaHome: Parameterizing Everyday Home Activities Towards 3D Generative Modeling of Human-Object Interactions
OmniSplat: Taming Feed-Forward 3D Gaussian Splatting for Omnidirectional Images with Editable Capabilities
Scenario Dreamer: Vectorized Latent Diffusion for Generating Driving Simulation Environments
PatchGuard: Adversarially Robust Anomaly Detection and Localization through Vision Transformers and Pseudo Anomalies
OSLoPrompt: Bridging Low-Supervision Challenges and Open-Set Domain Generalization in CLIP
ScribbleLight: Single Image Indoor Relighting with Scribbles
Hardware-Rasterized Ray-Based Gaussian Splatting
Fortifying Federated Learning Towards Trustworthiness via Auditable Data Valuation and Verifiable Client Contribution
Show and Tell: Visually Explainable Deep Neural Nets via Spatially-Aware Concept Bottleneck Models
Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy
Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding
A Universal Scale-Adaptive Deformable Transformer for Image Restoration across Diverse Artifacts
CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models
SimVS: Simulating World Inconsistencies for Robust View Synthesis
PSBD: Prediction Shift Uncertainty Unlocks Backdoor Detection
Chebyshev Attention Depth Permutation Texture Network with Latent Texture Attribute Loss
Rethinking Decoder Design: Improving Biomarker Segmentation Using Depth-to-Space Restoration and Residual Linear Attention
Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices
Cross-Modal Distillation for 2D/3D Multi-Object Discovery from 2D Motion
Locality-Aware Zero-Shot Human-Object Interaction Detection
ShowMak3r: Compositional TV Show Reconstruction
Spectral State Space Model for Rotation-Invariant Visual Representation Learning
AdMiT: Adaptive Multi-Source Tuning in Dynamic Environments
Efficient Event-Based Object Detection: A Hybrid Neural Network with Spatial and Temporal Attention
Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation
Denoising Functional Maps: Diffusion Models for Shape Correspondence
Simpler Diffusion: 1.5 FID on ImageNet512 with Pixel-space Diffusion
HELVIPAD: A Real-World Dataset for Omnidirectional Stereo Depth Estimation
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues
LiSu: A Dataset and Method for LiDAR Surface Normal Estimation
GBlobs: Explicit Local Structure via Gaussian Blobs for Improved Cross-Domain LiDAR-based 3D Object Detection
PEER Pressure: Model-to-Model Regularization for Single Source Domain Generalization
GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks
Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders
Learning to Highlight Audio by Watching Movies
O-TPT: Orthogonality Constraints for Calibrating Test-time Prompt Tuning in Vision-Language Models
Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent Spaces
Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval
GeoMM: On Geodesic Perspective for Multi-modal Learning
LATTE-MV: Learning to Anticipate Table Tennis Hits from Monocular Videos
Factored-NeuS: Reconstructing Surfaces, Illumination, and Materials of Possibly Glossy Objects
Potential Field Based Deep Metric Learning
Improving Semi-Supervised Semantic Segmentation with Sliced-Wasserstein Feature Alignment and Uniformity
Exploiting Deblurring Networks for Radiance Fields
Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering
PLeaS - Merging Models with Permutations and Least Squares
DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding
FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution
HOTFormerLoc: Hierarchical Octree Transformer for Versatile Lidar Place Recognition Across Ground and Aerial Views
Efficient Video Super-Resolution for Real-time Rendering with Decoupled G-buffer Guidance
Efficient Visual State Space Model for Image Deblurring
CrossOver: 3D Scene Cross-Modal Alignment
Towards Optimizing Large-Scale Multi-Graph Matching in Bioimaging
Semantic and Expressive Variations in Image Captions Across Languages
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding
Video-Guided Foley Sound Generation with Multimodal Controls
ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos
Context-Enhanced Memory-Refined Transformer for Online Action Detection
EntitySAM: Segment Everything in Video
Multi-modal Knowledge Distillation-based Human Trajectory Forecasting
Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis
Video Motion Transfer with Diffusion Transformers
Omni-ID: Holistic Identity Representation Designed for Generative Tasks
L-SWAG: Layer-Sample Wise Activation with Gradients Information for Zero-Shot NAS on Vision Transformers
RELOCATE: A Simple Training-Free Baseline for Visual Query Localization Using Region-Based Representations
Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation
Mixture of Submodules for Domain Adaptive Person Search
ETAP: Event-based Tracking of Any Point
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment
Joint Out-of-Distribution Filtering and Data Discovery Active Learning
DeepCompress-ViT: Rethinking Model Compression to Enhance Efficiency of Vision Transformers at the Edge
Variance-Based Membership Inference Attacks Against Large-Scale Image Captioning Models
Enhancing Privacy-Utility Trade-offs to Mitigate Memorization in Diffusion Models
Gen3DEval: Using vLLMs for Automatic Evaluation of Generated 3D Objects
MatAnyone: Stable Video Matting with Consistent Memory Propagation
Fractal Calibration for Long-tailed Object Detection
RASP: Revisiting 3D Anamorphic Art for Shadow-Guided Packing of Irregular Objects
Visual Representation Learning through Causal Intervention for Controllable Image Editing
Text-Driven Fashion Image Editing with Compositional Concept Learning and Counterfactual Abduction
HORP: Human-Object Relation Priors Guided HOI Detection
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
PICO: Reconstructing 3D People In Contact with Objects
InteractVLM: 3D Interaction Reasoning from 2D Foundational Models
Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video
Understanding Multi-Task Activities from Single-Task Videos
Golden Cudgel Network for Real-Time Semantic Segmentation
Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-invariant Representation Learning
AdaptCMVC: Robust Adaption to Incremental Views in Continual Multi-view Clustering
Black Hole-Driven Identity Absorbing in Diffusion Models
Multi-modal Topology-embedded Graph Learning for Spatially Resolved Genes Prediction from Pathology Images with Prior Gene Similarity Information
Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves
We use cookies to store which papers have been visited.
I agree
Successful Page Load
We use cookies to store which papers have been visited.
I agree