Skip to yearly menu bar
Skip to main content
Main Navigation
CVPR
Code of Conduct
Create Profile
Reset / Forgot Password
Privacy Policy
Contact CVPR
HELP/FAQ
Reset Password
My Stuff
Login
Select Year: (2025)
2025
2024
2023
Dates
Calls
Call for Papers
Call for Tutorial Proposals
Call for Workshop Proposals
Call for Musical Performance
Call for Socials
Call for AI Art
Call for Demos
Call for Participation: Doctoral Consortium
Author & Reviewer Guides
Author Guidelines
Author Suggested Practices
Author Ethics Guidelines
YouTube and Poster Art Uploads
Reviewers
Reviewer Guidelines
Poster Printing
Authors
Changes for 2025
How to complete your OpenReview profile
Clarification
Camera-Ready Submission Instructions
Author Submission Site Guide
Attend
2025 In-person FAQ
Register
Invitation Letter
Broadening Participation
Book Your Hotel
Code of Conduct
Keynotes & Panels
Tutorials
Workshops
Art Program
Expo
Sponsors
Exhibitor Information
Expo Schedule
Sponsor, Exhibitor List & Floor-plan
Promotional Opportunities
2025 Exhibitor Manual
Exhibitor/Sponsor PR Professionals
Media
Media Center
Get Media Pass
News and Resources
Organization
Organizing Committee
Program Committee
Layout:
mini
compact
topic
detail
×
No topics available
No sessions available
title
author
topic
session
shuffle
by
serendipity
bookmarked first
visited first
not visited first
bookmarked but not visited
Enable Javascript in your browser to see the papers page.
Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking
Mind the Gap: Detecting Black-box Adversarial Attacks in the Making through Query Update Analysis
Teaching Large Language Models to Regress Accurate Image Quality Scores using Score Distribution
Enhancing Online Continual Learning with Plug-and-Play State Space Model and Class-Conditional Mixture of Discretization
GauCho: Gaussian Distributions with Cholesky Decomposition for Oriented Object Detection
Self-Supervised Large Scale Point Cloud Completion for Archaeological Site Restoration
Diff-Palm: Realistic Palmprint Generation with Polynomial Creases and Intra-Class Variation Controllable Diffusion Models
SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding
Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond
DCEvo: Discriminative Cross-dimensional Evolutionary Learning for Infrared and Visible Image Fusion
GSTAR: Gaussian Surface Tracking and Reconstruction
Disentangled Pose and Appearance Guidance for Multi-Pose Generation
BioX-CPath: Biologically-driven Explainable Diagnostics for Multistain IHC Computational Pathology
Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders
Synthetic-to-Real Self-supervised Robust Depth Estimation via Learning with Motion and Structure Priors
EventGPT: Event Stream Understanding with Multimodal Large Language Models
SGSST: Scaling Gaussian Splatting Style Transfer
Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction
DeepCompress-ViT: Rethinking Model Compression to Enhance Efficiency of Vision Transformers at the Edge
Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation
AVQACL: A Novel Benchmark for Audio-Visual Question Answering Continual Learning
EmoEdit: Evoking Emotions through Image Manipulation
ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way
Investigating the Role of Weight Decay in Enhancing Nonconvex SGD
Diffusion-based Realistic Listening Head Generation via Hybrid Motion Modeling
PillarHist: Height-aware Histogram for Quantization-friendly Pillar Feature Encoder
Font-Agent: Enhancing Font Understanding with Large Language Models
Volume Tells: Dual Cycle-Consistent Diffusion for 3D Fluorescence Microscopy De-noising and Super-Resolution
Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models
IDEA-Bench: How Far are Generative Models from Professional Designing?
Once-Tuning-Multiple-Variants: Tuning Once and Expanded as Multiple Vision-Language Model Variants
Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model
Secret Lies in Color: Enhancing AI-Generated Images Detection with Color Distribution Analysis
Spiking Transformer: Introducing Accurate Addition-Only Spiking Self-Attention for Transformer
SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
A Hubness Perspective on Representation Learning for Graph-Based Multi-View Clustering
LT3SD: Latent Trees for 3D Scene Diffusion
Multi-modal Vision Pre-training for Medical Image Analysis
StickMotion: Generating 3D Human Motions by Drawing a Stickman
Rethinking Temporal Fusion with A Unified Gradient Descent View for 3D Semantic Occupancy Prediction
SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation
BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs
UMFN: Unified Multi-Domain Face Normalization for Joint Cross-domain Prototype Learning and Heterogeneous Face Recognition
OW-OVD: Unified Open World and Open Vocabulary Object Detection
Improving Sound Source Localization with Joint Slot Attention on Image and Audio
ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning
ACE: Anti-Editing Concept Erasure in Text-to-Image Models
BiM-VFI: Bidirectional Motion Field-Guided Frame Interpolation for Video with Non-uniform Motions
Leveraging Temporal Cues for Semi-Supervised Multi-View 3D Object Detection
Pay Attention to the Foreground in Object-Centric Learning
PIAD: Pose and Illumination agnostic Anomaly Detection
ChatGen: A Unified Model for Interactive Multimodal Generation as We Chat
TopoCellGen: Generating Histopathology Cell Topology with a Diffusion Model
Generative Multiview Relighting for 3D Reconstruction under Extreme Illumination Variation
M$^3$-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation
Joint Vision-Language Social Bias Removal for CLIP
Is this Generated Person Existed in Real-world? Fine-grained Detecting and Calibrating Abnormal Human-body
SmartCLIP: Modular Vision-language Alignment with Identification Guarantees
Exploring CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation
TAGA: Self-supervised Learning for Template-free Animatable Gaussian Articulated Model
Towards Consistent Multi-Task Learning: Unlocking the Potential of Task-Specific Parameters
GIFStream: 4D Gaussian-based Immersive Video with Feature Stream
Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding
Your Scale Factors are My Weapon: Targeted Bit-Flip Attacks on Vision Transformers via Scale Factor Manipulation
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices
From Elements to Design: A Layered Approach for Automatic Graphic Design Composition
Spk2SRImgNet: Super-Resolve Dynamic Scene from Spike Stream via Motion Aligned Collaborative Filtering
Task-Aware Clustering for Prompting Vision-Language Models
Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos
Imputation-free and Alignment-free: Incomplete Multi-view Clustering Driven by Consensus Semantic Learning
Estimating Body and Hand Motion in an Ego‑sensed World
Decentralized Diffusion Models
SapiensID: Foundation for Human Recognition
Towards Cost-Effective Learning: A Synergy of Semi-Supervised and Active Learning
OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection
Explainable Saliency: Articulating Reasoning with Contextual Prioritization
Sparse Image Sets Restoration with Multi-View Diffusion Model
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization
Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training
Weakly Supervised Temporal Action Localization via Dual-Prior Collaborative Learning Guided by Multimodal Large Language Models
Reconstructing Animals and the Wild
STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding
VisionArena: 230k Real World Image Conversations with Paired Human Preferences
BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance
VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding
DSV-LFS: Unifying LLM-Driven Semantic Cues with Visual Features for Robust Few-Shot Segmentation
UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming
PhaseScene : Dynamic Scene Generation with Phase-Specific Action Modeling for Embodied AI
MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation
ZoomLDM: Latent Diffusion Model for multi-scale image generation
Vision-Language Gradient Descent-driven All-in-One Deep Unfolding Networks
SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning
Motion Prompting: Controlling Video Generation with Motion Trajectories
Gaussian World Model for Streaming 3D Occupancy Prediction
MATCHA: Towards Matching Anything
4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models
DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes
A$^\text{T}$A: Adaptive Transformation Agent for Text-Guided Subject-Position Variable Background Inpainting
IDOL: Instant Photorealistic 3D Human Creation from a Single Image
Mani-GS: Gaussian Splatting Manipulation with Triangular Mesh
FluxSpace: Disentangled Image Editing in Rectified Flow Models
LP-Diff: Towards Improved Restoration of Real-World Degraded License Plate
Denoising Functional Maps: Diffusion Models for Shape Correspondence
UniSTD: Towards Unified Spatio-Temporal Prediction across Diverse Disciplines
Blurry-Edges: Photon-Limited Depth Estimation from Defocused Boundaries
DiTASK: Multi-Task Fine-Tuning with Diffeomorphic Transformations
ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions
HD-EPIC: A Highly-Detailed Egocentric Video Dataset
MEET: Towards Memory-Efficient Temporal Delta-Sigma Deep Neural Networks
Online Task-Free Continual Learning via Dynamic Expansionable Memory Distribution
PRaDA: Projective Radial Distortion Averaging
PTDiffusion: Free Lunch for Generating Optical Illusion Hidden Pictures using Phase-Transferred Diffusion Model
Monocular Depth Priors for Robust Structure-from-Motion
Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation
MeshArt: Generating Articulated Meshes with Structure-guided Transformers
CALICO: Multi-Image Pixel-Grounded Object Comparison by Parts with Large Language Models
Locally Orderless Images for Optimization in Differentiable Rendering
A Unified Approach to Interpreting Self-supervised Pre-training Methods for 3D Point Clouds via Interactions
ReWind: Understanding Long Videos with Instructed Learnable Memory
MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking
Saliuitl: Ensemble Salience Guided Recovery of Adversarial Patches against CNNs
Ev-3DOD: Pushing the Temporal Boundaries of 3D Object Detection with Event Cameras
OmniSplat: Taming Feed-Forward 3D Gaussian Splatting for Omnidirectional Images with Editable Capabilities
Attend to Not Attended: Structure-then-Detail Token Merging for Post-training DiT Acceleration
GPAvatar: High-fidelity Head Avatars by Learning Efficient Gaussian Projections
It’s a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
RSAR: Restricted State Angle Resolver and Rotated SAR Benchmark
Uncertainty-guided Perturbation for Image Super-Resolution Diffusion Model
GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model
MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation
Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis
MODA: Motion-Drift Augmentation for Inertial Human Motion Analysis
GaussHDR: High Dynamic Range Gaussian Splatting via Learning Unified 3D and 2D Local Tone Mapping
Disentangling Safe and Unsafe Image Corruptions via Anisotropy and Locality
Sim-to-Real Causal Transfer: A Metric Learning Approach to Causally-Aware Interaction Representations
HumanRig: Learning Automatic Rigging for Humanoid Character in a Large Scale Dataset
DrivingSphere: Building a High-fidelity 4D World for Closed-loop Simulation
Towards Lossless Implicit Neural Representation via Bit Plane Decomposition
KMD: Koopman Multi-modality Decomposition for Generalized Brain Tumor Segmentation under Incomplete Modalities
Golden Cudgel Network for Real-Time Semantic Segmentation
SALOVA: Segment-Augmented Long Video Assistance for Targeted Retrieval and Routing in Long-Form Video Analysis
HSI: A Holistic Style Injector for Arbitrary Style Transfer
From Alexnet to Transformers: Measuring the Non-linearity of Deep Neural Networks with Affine Optimal Transport
FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis
Analyzing the Synthetic-to-Real Domain Gap in 3D Hand Pose Estimation
F^3OCUS - Federated Finetuning of Vision-Language Foundation Models with Optimal Client Layer Updating Strategy via Multi-objective Meta-Heuristics
Learning Endogenous Attention for Incremental Object Detection
HRAvatar: High-Quality and Relightable Gaussian Head Avatar
Multimodal Autoregressive Pre-training of Large Vision Encoders
GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation
DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving
PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models
UCOD-DPL: Unsupervised Camouflaged Object Detection via Dynamic Pseudo-label Learning
HumanMM: Global Human Motion Recovery from Multi-shot Videos
MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction
Parallelized Autoregressive Visual Generation
MEGA: Masked Generative Autoencoder for Human Mesh Recovery
Self-Supervised Cross-View Correspondence with Predictive Cycle Consistency
InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception
DV-Matcher: Deformation-based Non-rigid Point Cloud Matching Guided by Pre-trained Visual Features
Collaborative Tree Search for Enhancing Embodied Multi-Agent Collaboration
Repurposing Stable Diffusion Attention for Training-Free Unsupervised Interactive Segmentation
ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate
EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion
EchoMatch: Partial-to-Partial Shape Matching via Correspondence Reflection
CrossSDF: 3D Reconstruction of Thin Structures From Cross-Sections
ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting
Q-PART: Quasi-Periodic Adaptive Regression with Test-time Training for Pediatric Left Ventricular Ejection Fraction Regression
RC-AutoCalib: An End-to-End Radar-Camera Automatic Calibration Network
Segment This Thing: Foveated Tokenization for Efficient Point-Prompted Segmentation
Towards General Visual-Linguistic Face Forgery Detection
Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity
VSNet: Focusing on the Linguistic Characteristics of Sign Language
EdgeDiff: Edge-aware Diffusion Network for Building Reconstruction from Point Clouds
Visual and Semantic Prompt Collaboration for Generalized Zero-Shot Learning
Style-Editor: Text-driven object-centric style editing
Flow-NeRF: Joint Learning of Geometry, Poses, and Dense Flow within Unified Neural Representations
SFDM: Robust Decomposition of Geometry and Reflectance for Realistic Face Rendering from Sparse-view Images
Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues
HyperLoRA: Parameter-Efficient Adaptive Generation for Portrait Synthesis
Image Quality Assessment: From Human to Machine Preference
Hierarchical Gaussian Mixture Model Splatting for Efficient and Part Controllable 3D Generation
Decompositional Neural Scene Reconstruction with Generative Diffusion Prior
Feature Information Driven Position Gaussian Distribution Estimation for Tiny Object Detection
Rethinking Token Reduction with Parameter-Efficient Fine-Tuning in ViT for Pixel-Level Tasks
CompGS: Unleashing 2D Compositionality for Compositional Text-to-3D via Dynamically Optimizing 3D Gaussians
FactCheXcker: Mitigating Measurement Hallucinations in Chest X-ray Report Generation Models
Logits DeConfusion with CLIP for Few-Shot Learning
SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance
A Bias-Free Training Paradigm for More General AI-generated Image Detection
SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing
Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations
MonoTAKD: Teaching Assistant Knowledge Distillation for Monocular 3D Object Detection
Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion
T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation
ReNeg: Learning Negative Embedding with Reward Guidance
LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos
HOTFormerLoc: Hierarchical Octree Transformer for Versatile Lidar Place Recognition Across Ground and Aerial Views
MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation
Relation-Rich Visual Document Generator for Visual Information Extraction
AutoURDF: Unsupervised Robot Modeling from Point Cloud Frames Using Cluster Registration
Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs
Style Quantization for Data-Efficient GAN Training
Knowledge-Aligned Counterfactual-Enhancement Diffusion Perception for Unsupervised Cross-Domain Visual Emotion Recognition
Opportunistic Single-Photon Time of Flight
LookCloser: Frequency-aware Radiance Field for Tiny-Detail Scene
EASEMVC:Efficient Dual Selection Mechanism for Deep Multi-View Clustering
Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
VideoSPatS: Video Spatiotemporal Splines for Disentangled Occlusion, Appearance and Motion Modeling and Editing
Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation
OmniStereo: Real-time Omnidireactional Depth Estimation with Multiview Fisheye Cameras
Cross-View Completion Models are Zero-shot Correspondence Estimators
MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling
SOGS: Second-Order Anchor for Advanced 3D Gaussian Splatting
FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs
FinePhys: Fine-grained Human Action Generation by Explicitly Incorporating Physical Laws for Effective Skeletal Guidance
MoFlow: One-Step Flow Matching for Human Trajectory Forecasting via Implicit Maximum Likelihood Estimation Distillation
Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability
FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views
LiveComment: Learn Streaming Video LLM with Speech Transcription at Scale
Consistency-aware Self-Training for Iterative-based Stereo Matching
Domain Generalization in CLIP via Learning with Diverse Text Prompts
Textured Gaussians for Enhanced 3D Scene Appearance Modeling
Subspace Constraint and Contribution Estimation for Heterogeneous Federated Learning
FlexGS: Train Once, Deploy Everywhere with Many-in-One Flexible 3D Gaussian Splatting
Instance-wise Supervision-level Optimization in Active Learning
OODD: Test-time Out-of-Distribution Detection with Dynamic Dictionary
MaDCoW: Marginal Distortion Correction for Wide-Angle Photography with Arbitrary Objects
ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary
EAP-GS: Efficient Augmentation of Pointcloud for 3D Gaussian Splatting in Few-shot Scene Reconstruction
S$^3$GaitNet: Learning Local Features and Size Awareness from LiDAR Point Clouds for 3D Gait Recognition
ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models
OnlineAnySeg: Online Zero-Shot 3D Segmentation by Visual Foundation Model Guided 2D Mask Merging
DEAL: Data-Efficient Adversarial Learning for High-Quality Infrared Imaging
EBS-EKF: Accurate and High Frequency Event-based Star Tracking
Optical LEGO: An Optical Imaging Dataset and Benchmark at Deeply Subwavelength Resolution
Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs
HUNet: Homotopy Unfolding Network for Image Compressive Sensing
BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence
A Unified, Resilient, and Explainable Adversarial Patch Detector
Track Any Anomalous Object:A Granular Video Anomaly Detection Pipeline
METASCENES: Towards Automated Replica Creation for Real-world 3D Scans
SeaLion: Semantic Part-Aware Latent Point Diffusion Models for 3D Generation
RoboPEPP: Vision-Based Robot Pose and Joint Angle Estimation through Embedding Predictive Pre-Training
HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization
Incomplete Multi-modal Brain Tumor Segmentation via Learnable Sorting State Space Model
Rashomon Sets for Prototypical-Part Models: Editing Accurate Interpretable Models in Real-Time
VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding
Simulator HC: Regression-based Online Simulation of Starting Problem-Solution Pairs for Homotopy Continuation in Geometric Vision
CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning
VidComposition: Can MLLMs Analyze Compositions in Compiled Video?
Not Just Text: Uncovering Vision Modality Threats in Image Generation Models
CoMBO: Conflict Mitigation via Branched Optimization for Class Incremental Segmentation
MDP: Multidimensional Vision Model Pruning with Latency Constraint
Ego4o: Egocentric Human Motion Capture and Understanding from Multi-Modal Input
WISE: A Framework for Gigapixel Whole-Slide-Image Lossless Compression
Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks
Causal Composition Diffusion Model for Closed-loop Traffic Generation
Monocular and Generalizable Gaussian Talking Head Animation
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
MangaNinja: Line Art Colorization with Precise Reference Following
Any6DPose: Model-free 6D Pose Estimation of Novel Objects
ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems
EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation
BWFormer: Building Wireframe Reconstruction from airborne LiDAR point clouds with Transformer
Multirate Neural Image Compression with Adaptive Lattice Vector Quantization
Keypoints Good for the Two-View Geometry Estimation Problem
CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology
DynaMoDe-NeRF: Motion-aware Deblurring Neural Radiance Field for Dynamic Scenes
VideoGigaGAN: Towards Detail-rich Video Super-Resolution
Structured Artifact Removal with Scale-Adaptive Deformable Transformer
DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding
Are Spatial-Temporal Graph Convolution Networks for Human Action Recognition Over-Parameterized?
Non-Natural Image Understanding with Advancing Frequency-based Vision Encoders
ActiveGAMER: Active GAussian Mapping through Efficient Rendering
MultiGO: Towards Multi-level Geometry Learning for Monocular 3D Textured Human Reconstruction
GroomLight: Hybrid Inverse Rendering for Relightable Human Hair Appearance Modeling
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Open Ad-hoc Categorization with Contextualized Feature Learning
Generative Modeling of Class Probability for Multi Modal Representation Learning
Condensing Action Segmentation Datasets via Generative Network Inversion
LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models
Color Conditional Generation with Sliced Wasserstein Distance
VideoDPO: Omni-Preference Alignment for Video Diffusion Generation
PS-Diffusion: Photorealistic Subject-Driven Image Editing with Disentangled Control and Attention
SocialMOIF:Multi-Order Intention Fusion for Pedestrain Trajectory Prediction
Parallel Sequence Modeling via Generalization Spatial Propagation Network
MC$^2$: Multi-concept Guidance for Customized Multi-concept Generation
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
Seek Common Ground While Reserving Differences: Semi-supervised Image-Text Sentiment Recognition
Hazy Low-Quality Satellite Video Restoration Via Learning Optimal Joint Degradation Patterns and Continuous-Scale Super-Resolution Reconstruction
SLVR: Super-Light Visual Reconstruction via Blueprint Controllable Convolutions and Exploring Feature Diversity Representation
HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator
High Dynamic Range Video Compression: A Large-Scale Benchmark Dataset and A Learned Bit-depth Scalable Compression Algorithm
MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting
FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning
PhysAnimator: Physics-Guided Generative Cartoon Animation
SAM-REF: Introducing Image-Prompt Synergy during Interaction for Detail Enhancement in the Segment Anything Model
Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment
Open-World Objectness Modeling Unifies Novel Object Detection
NTClick: Achieving Precise Interactive Segmentation With Noise-tolerant Clicks
GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing
Relation3D : Enhancing Relation Modeling for Point Cloud Instance Segmentation
BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation
SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-image Diffusion Models
Generalized Gaussian Entropy Model for Point Cloud Attribute Compression with Dynamic Likelihood Intervals
2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification
EVPGS: Enhanced View Prior Guidance for Splatting-based Extrapolated View Synthesis
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models
AMO Sampler: Enhancing Text Rendering with Overshooting
CoCoGaussian: Leveraging Circle of Confusion for Gaussian Splatting from Defocused Images
Wonderland: Navigating 3D Scenes from a Single Image
Goku: Generative Flow Kit for Unified Image-Video Creation
Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation
Can Text-to-Video Generation help Video-Language Alignment?
Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis
SplineGS: Robust Motion-Adaptive Spline for Real-Time Dynamic 3D Gaussians from Monocular Video
EZSR: Event-based Zero-Shot Recognition
MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization
Beyond Clean Training Data: A Versatile and Model-Agnostic Framework for Out-of-Distribution Detection with Contaminated Training Data
Reconstructing People, Places, and Cameras
Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models
EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance
Attention IoU: Examining Biases in CelebA using Attention Maps
DEFOM-Stereo: Depth Foundation Model Based Stereo Matching
A3: Few-shot Prompt Learning of Unlearnable Examples with Cross-Modal Adversarial Feature Alignment
Towards Explicit Geometry-Reflectance Collaboration for Generalized LiDAR Segmentation in Adverse Weather
Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
MotionDiT: Text-Based Human Motion Editing with Motion Similarity Prediction via Diffusion Transformers
STAR-Edge: Structure-aware Local Spherical Curve Representation for Thin-walled Edge Extraction from Unstructured Point Clouds
VideoAlchemy: Open-set Personalization in Video Generation
Gazing at Rewards: Eye Movements as a Lens into Human and AI Decision-Making in Hybrid Visual Foraging
Omni-ID: Holistic Identity Representation Designed for Generative Tasks
LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living
CryptoFace: End-to-End Encrypted Face Recognition
Generative Zero-Shot Composed Image Retrieval
AIM-Fair: Advancing Algorithmic Fairness via Selectively Fine-Tuning Biased Models with Contextual Synthetic Data
Cross-Modal Distillation for 2D/3D Multi-Object Discovery from 2D motion
Turbo3D: Ultra-fast Text-to-3D Generation
InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation
FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement
Synthetic Prior for Few-Shot Drivable Head Avatar Inversion
Compositional Caching for Training-free Open-vocabulary Attribute Detection
Photorealistic Simulation-Ready Garments from a Single Pose
Reference-Based 3D-Aware Image Editing with Triplanes
AnyAttack: Targeted Adversarial Attacks on Vision-Language Models Toward Any Images
$\beta$-FFT: Nonlinear Interpolation and Differentiated Training Strategies for Semi-Supervised Medical Image Segmentation
Generating 3D-Consistent Videos from Unposed Internet Photos
Controllable Human Image Generation with Personalized Multi-Garments
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research
Feat2GS: Probing Visual Foundation Models with Gaussian Splatting
JiSAM: Alleviate Labeling Burden and Corner Case Problems in Autonomous Driving via Minimal Real-World Data
BIOMEDICA: An Open Biomedical Image-Caption Archive with Vision-Language Models derived from Scientific Literature
NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images
3DEnhancer: Consistent Multi-View Diffusion for 3D Enhancement
Towards Smart Point-and-Shoot Photography
VisionZip: Longer is Better but Not Necessary in Vision Language Models
PanDA: Towards Panoramic Depth Anything with Unlabeled Panoramas and M\"obius Spatial Augmentation
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Event-Equalized Dense Video Captioning
BHViT: Binarized Hybrid Vision Transformer
S2Gaussian: Sparse-View Super-Resolution 3D Gaussian Splatting
Evaluating generated 3D assets using multiview Large Language Models
Learnable Infinite Taylor Gaussian for Dynamic View Rendering
FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs
Towards Optimizing Large-Scale Multi-Graph Matching in Bioimaging
Unconstrained 3D gaze estimation with Gaze-Aware 3D Context Encoding
MonSter: Marry Monodepth to Stereo Unleashes Power
EditSplat: Multi-View Fusion and Attention-Guided Optimization for View-Consistent 3D Scene Editing with 3D Gaussian Splatting
Shining Yourself: High-Fidelity Ornaments Virtual Try-on with Diffusion Model
Splatter-360: Generalizable 360$^{\circ}$ Gaussian Splatting for Wide-baseline Panoramic Images
Scene Map-based Prompt Tuning for Navigation Instruction Generation
Binarized Neural Network for Multi-spectral Image Fusion
Animate and Sound an Image
STINR: Deciphering Spatial Transcriptomics via Implicit Neural Representation
Object-Shot Enhanced Grounding Network for Egocentric Video
Using Powerful Prior Knowledge of Diffusion Model in Deep Unfolding Networks for Image Compressive Sensing
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
Supervising Sound Localization using In-the-wild Egomotion
Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning
MonoDGP: Monocular 3D Object Detection with Decoupled-Query and Geometry-Error Priors
JTD-UAV: MLLM-Enhanced Joint Tracking and Description Framework for Anti-UAV Systems
UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation
Unboxed: Geometrically and Temporally Consistent Video Outpainting
DORNet: A Degradation Oriented and Regularized Network for Blind Depth Super-Resolution
Foundations of the Theory of Performance-Based Ranking
AVF-MAE++: Scaling Affective Video Facial Masked Autoencoders via Efficient Audio-Visual Self-Supervised Learning
Medusa: A Multi-Scale High-order Contrastive Dual-Diffusion Approach for Multi-View Clustering
When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning
CocoER: Aligning Multi-Level Feature by Competition and Coordination for Emotion Recognition
Revisiting Generative Replay for Class Incremental Object Detection
RefPose: Leveraging Reference Geometric Correspondences for Accurate 6D Pose Estimation of Unseen Objects
Visual-Instructed Degradation Diffusion for All-in-One Image Restoration
Point-Level Visual Affordance Guided Retrieval and Adaptation for Cluttered Garments Manipulation
DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval
ROLL: Robust Noisy Pseudo-label Learning for Multi-View Clustering with Noisy Correspondence
TriTex: Learning Texture from a Single Mesh via Triplane Semantic Features
Adaptive Parameter Selection for Tuning Vision-Language Models
Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields
TKG-DM: Training-free Chroma Key Content Generation Diffusion Model
FilmComposer: LLM-Driven Music Production for Silent Film Clips
DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching
DeepLA-Net: Very Deep Local Aggregation Networks for Point Cloud Analysis
SharpDepth: Sharpening Metric Depth Predictions Using Diffusion Distillation
RoGSplat: Learning Robust Generalizable Human Gaussian Splatting from Sparse Multi-View Images
Person De-reidentification: A Variation-guided Identity Shift Modeling
Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks
3D-SLNR: A Super Lightweight Neural Representation for Large-scale 3D Mapping
Self-Supervised Learning for Color Spike Camera Reconstruction
LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
Don't Shake the Wheel: Momentum-Aware Planning in End-to-End Autonomous Driving
Argus: A Compact and Versatile Foundation Model for Vision
ImPortrait: Implicit Condition Control for Enhanced Portrait Animation
Rethinking the Adversarial Robustness of Multi-Exit Neural Networks in an Attack-Defense Game
LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping
FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model
DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution
TAMT: Temporal-Aware Model Tuning for Cross-Domain Few-Shot Action Recognition
AToM: Aligning Text-to-Motion Model at Event-Level with GPT-4Vision Reward
ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding Download PDF
BG-Triangle: Bézier Gaussian Triangle for 3D Vectorization and Rendering
BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models
FRESA: Feedforward Reconstruction of Personalized Skinned Avatars from Few Images
Towards All-in-One Medical Image Re-Identification
Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling
Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement
Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers
PICO: Reconstructing 3D People In Contact with Objects
GASP: Gaussian Avatars with Synthetic Priors
ForestLPR: LiDAR Place Recognition in Forests Attentioning Multiple BEV Density Images
Exploring Semantic Feature Discrimination for Perceptual Image Super-Resolution and Opinion-Unaware No-Reference Image Quality Assessment
Unity in Diversity: Video Editing via Gradient-Latent Purification
Data-Free Group-Wise Fully Quantized Winograd Convolution via Learnable Scales
SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts
Complexity Experts are Task-Discriminative Learners for Any Image Restoration
Robust 3D Shape Reconstruction in Zero-Shot from a Single Image in the Wild
Cross-Modal Interactive Perception Network with Mamba for Lung Tumor Segmentation in PET-CT Images
h-Edit: Effective and Flexible Diffusion-Based Editing via Doob's h-Transform
SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding
MET3R: Measuring Multi-View Consistency in Generated Images
Time of the Flight of the Gaussians: Fast and Accurate Dynamic Time-of-Flight Radiance Fields
ProjAttacker: A Configurable Physical Adversarial Attack for Face Recognition via Projector
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment
GLane3D : Detecting Lanes with Graph of 3D Keypoints
Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models by Recycling Pre-Tuned LoRAs
Towards Improved Text-Aligned Codebook Learning: Multi-Hierarchical Codebook-Text Alignment with Long Text
DualPM: Dual Posed-Canonical Point Maps for 3D Shape and Pose Reconstruction
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
Nonisotropic Gaussian Diffusion for Realistic 3D Human Motion Prediction
Type-R: Automatically Retouching Typos for Text-to-Image Generation
Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models
MoEdit: On Learning Quantity Perception for Multi-object Image Editing
Pippo: High-Resolution Multi-View Humans from a Single Image
UNEM: UNrolled Generalized EM for Transductive Few-Shot Learning
COB-GS: Clear Object Boundaries in 3DGS Segmentation Based on Boundary-Adaptive Gaussian Splitting
Memories of Forgotten Concepts
Point Clouds Meets Physics: Dynamic Acoustic Field Fitting Network for Point Cloud Understanding
ATP: Adaptive Threshold Pruning for Efficient Data Encoding in Quantum Neural Networks
A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training
Spatial-Temporal Graph Diffusion Policy with Kinematics Modeling for Bimanual Robotic Manipulation
Dense-To-Sparse Video Diffusion For High-fidelity Multi-View Images Synthesis
Emphasizing Discriminative Features for Dataset Distillation in Complex Scenarios
ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On
Less is More: Efficient Model Merging with Binary Task Switch
Black Hole-Driven Identity Absorbing in Diffusion Models
Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model
Revisiting Source-Free Domain Adaptation: Insights into Representativeness, Generalization, and Diversity
From Sparse to Dense: Camera Relocalization with Scene-Specific Detector from Feature Gaussian Splatting
Rethinking Training for De-biasing Text-to-Image Generation: Unlocking the Potential of Stable Diffusion
Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification
Population Normalization for Federated Learning
PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation
StageDesigner: Artistic Stage Generation for Scenography via Theater Scripts
Bridging Viewpoint Gaps: Geometric Reasoning Boosts Semantic Correspondence
Frequency-Biased Synergistic Design for Image Compression and Compensation
BF-STVSR: B-Splines and Fourier---Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution
Audio-Visual Semantic Graph Network for Audio-Visual Event Localization
Precise Event Spotting in Sports Videos: Solving Long-Range Dependency and Class Imbalance
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models
NoiseCtrl: A Sampling-Algorithm-Agnostic Conditional Generation Method for Diffusion Models
Rotation-Equivariant Self-Supervised Method in Image Denoising
CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation
Search and Detect: Training-Free Long Tail Object Detection via Web-Image Retrieval
Dual Diffusion for Unified Image Generation and Understanding
Dynamic Pseudo Labeling via Gradient Cutting for High-Low Entropy Exploration
SEAL: Semantic Attention Learning for Long Video Representation
SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction
Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think
Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis
SemAlign3D: Semantic Correspondence between RGB-Images through Aligning 3D Object-Class Representations
CLIP-driven Coarse-to-fine Semantic Guidance for Fine-grained Open-set Semi-supervised Learning
Where's the liability in the Generative Era? Recovery-based Black-Box Detection of AI-Generated Content
Pose-Guided Temporal Enhancement for Robust Low-Resolution Hand Reconstruction
Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision
Learning Heterogeneous Tissues with Mixture of Experts for Gigapixel Whole Slide Images
Fuzzy Multimodal Learning for Trusted Cross-modal Retrieval
Learning Physics From Video: Unsupervised Physical Parameter Estimation for Continuous Dynamical Systems
TADFormer : Task-Adaptive Dynamic TransFormer for Efficient Multi-Task Learning
Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise Flow
Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception
Learning Conditional Space-Time Prompt Distributions for Video Class-Incremental Learning
ACAttack: Adaptive Cross Attacking RGB-T Tracker via Multi-Modal Response Decoupling
Seeking Consistent Flat Minima for Better Domain Generalization via Refining Loss Landscapes
Heterogeneous Skeleton-Based Action Representation Learning
ReSpec: Relevance and Specificity Grounded Online Filtering for Learning on Video-Text Data Streams
Lessons Learned from a Unifying Empirical Study of Parameter-Efficient Fine-Tuning (PEFT) in Visual Recognition
FedCALM: Conflict-aware Layer-wise Mitigation for Selective Aggregation in Deeper Personalized Federated Learning
HybridMQA: Exploring Geometry-Texture Interactions for Colored Mesh Quality Assessment
Dual Energy-Based Model with Open-World Uncertainty Estimation for Out-of-distribution Detection
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos
CTRL-O: Language-Controllable Object-Centric Visual Representation Learning
Mixture of Submodule for Domain Adaptive Person Search
Test-Time Fine-Tuning of Image Compression Models for Multi-Task Adaptability
HuPerFlow: A Comprehensive Benchmark for Human vs. Machine Motion Estimation
Preserving Clusters in Prompt Learning for Unsupervised Domain Adaptation
Interactive Medical Image Analysis with Concept-based Similarity Reasoning
AutoPresent: Designing Structured Visuals From Scratch
Joint Scheduling of Causal Prompts and Tasks for Multi-Task Learning
VL2Lite: Task-Specific Knowledge Distillation from Large Vision-Language Models to Lightweight Networks
BOE-ViT: Boosting Orientation Estimation with Equivariance in Self-Supervised 3D Subtomogram Alignment
SceneDiffuser++: City-Scale Traffic Simulation via a Generative World Model
Automated Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression
CRISP: Object Pose and Shape Estimation with Test-Time Adaptation
UniPhy: Learning a Unified Constitutive Model for Inverse Physics Simulation
Gradient Inversion Attacks on Parameter-Efficient Fine-Tuning
Multi-Group Proportional Representations for Text-to-Image Models
Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation
RDD: Robust Feature Detector and Descriptor using Deformable Transformer
Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent Spaces
Learning Visual Composition through Improved Semantic Guidance
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models
TANGO: Training-free Embodied AI Agents for Open-world Tasks
Multi-View Pose-Agnostic Change Localization with Zero Labels
On-Device Self-Supervised Learning of Low-Latency Monocular Depth from Only Events
Improving the Training of Data Efficient GANs via Quality Aware Dynamic Discriminator Rejection Sampling
Rethinking End-to-End 2D to 3D Scene Segmentation in Gaussian Splatting
STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding
Speedy-Splat: Fast 3D Gaussian Splatting with Sparse Pixels and Sparse Primitives
Can Large Vision-Language Models Correct Grounding Errors By Themselves?
SketchVideo: Sketch-based Video Generation and Editing
Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis
Learning Phase Distortion with Selective State Space Models for Video Turbulence Mitigation
Improving Gaussian Splatting with Localized Points Management
WiLoR: End-to-end 3D hand localization and reconstruction in-the-wild
Customized Condition Controllable Generation for Video Soundtrack
UA-Pose: Uncertainty-Aware 6D Object Pose Estimation and Online Object Completion with Partial References
Order-Robust Class Incremental Learning: Graph-Driven Dynamic Similarity Grouping
QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge
The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition
AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers
A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models
EnvPoser: Environment-aware Realistic Human Motion Estimation from Sparse Observations with Uncertainty Modeling
DNF: Unconditional 4D Generation with Dictionary-based Neural Fields
RePerformer: Immersive Human-centric Volumetric Videos from Playback to Photoreal Reperformance
HyperPose: Hypernetwork-Infused Camera Pose Localization and an Extended Cambridge Landmarks Dataset
Reanimating Images using Neural Representations of Dynamic Stimuli
Flash3D: Super-scaling Point Transformers through Joint Hardware-Geometry Locality
MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking
MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output
FlexUOD: The Answer to Real-world Unsupervised Image Outlier Detection
FedBiP: Heterogeneous One-Shot Federated Learning with Personalized Latent Diffusion Models
AdMiT: Adaptive Multi-Source Tuning in Dynamic Environments
Instruction-based Image Manipulation by Watching How Things Move
LoKi: Low-dimensional KAN for Efficient Fine-tuning Image Models
FiRe: Fixed-points of Restoration Priors for Solving Inverse Problems
Temporally Consistent Object-Centric Learning by Contrasting Slots
A new statistical model of star speckles for learning to detect and characterize exoplanets in direct imaging observations
DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion
LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty
MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes
Exploring Scene Affinity for Semi-Supervised LiDAR Semantic Segmentation
SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding
Exploring Simple Open-Vocabulary Semantic Segmentation
OD3R: Online Dense 3D Reconstruction of Humans and Scenes from Monocular Videos
Weakly Supervised Contrastive Adversarial Training for Learning Robust Features from Semi-supervised Data
Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers
Electromyography-Informed Facial Expression Reconstruction For Physiological-Based Synthesis and Analysis
HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation
Graph Neural Network Combining Event Stream and Periodic Aggregation for Low-Latency Event-based Vision
Hyperbolic Safety-Aware Vision-Language Models
Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding
LC-Mamba: Local and Continuous Mamba with Shifted Windows for Frame Interpolation
Domain Adaptive Diabetic Retinopathy Grading with Model Absence and Flowing Data
Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis
T-FAKE: Synthesizing Thermal Images for Facial Landmarking
FedMIA: An Effective Membership Inference Attack Exploiting "All for One" Principle in Federated Learning
Explaining Domain Shifts in Language: Concept Erasing for Interpretable Image Classification
Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning
GroupMamba: Efficient Group-Based Visual State Space Model
Boost the Inference with Co-training: A Depth-guided Mutual Learning Framework for Semi-supervised Medical Polyp Segmentation
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages
SciBench: Addressing Scientific Illusions in Image Synthesis
GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs
RoadSocial: A Diverse Dataset and Benchmark for Road Event Understanding from Social Video Narratives
Do ImageNet-trained models learn shortcuts? The impact of frequency shortcuts on generalization
Generalized Few-shot 3D Point Cloud Segmentation with Vision-Language Model
ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression
Unlearning through Knowledge Overwriting: Reversible Federated Unlearning via Selective Sparse Adapter
LSNet: See Large, Focus Small
Good, Cheap, and Fast: Overfitted Image Compression with Wasserstein Distortion
SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding
V$^2$Dial: Unification of Video and Visual Dialog via Multimodal Experts
HyperGS: Hyperspectral 3D Gaussian Splatting
Simpler Diffusion: 1.5 FID on ImageNet512 with pixel-space diffusion
DTGBrepGen: A Novel B-rep Generative Model through Decoupling Topology and Geometry
NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training
Edge-SD-SR: Low Latency and Parameter Efficient On-device Super-Resolution with Stable Diffusion via Bidirectional Conditioning
Evaluating Model Perception of Color Illusions in Photorealistic Scenes
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes
MuTri: Multi-view Tri-alignment for OCT to OCTA 3D Image Translation
Towards Fine-Grained Interpretability: Counterfactual Explanations for Misclassification with Saliency Partition
Enhancing Dataset Distillation via Non-Critical Region Refinement
VidTwin: Video VAE with Decoupled Structure and Dynamics
FineVQ: Fine-Grained User Generated Content Video Quality Assessment
Mesh Mamba: A Unified State Space Model for Saliency Prediction in Non-Textured and Textured Meshes
FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion
Discriminative Fine-tuning of LVLMs
FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding
FLAVC: Learned Video Compression with Feature Level Attention
MP-GUI: Modality Perception with MLLMs for GUI Understanding
Adapting Text-to-Image Generation with Feature Difference Instruction for Generic Image Restoration
Pathways on the Image Manifold: Image Editing via Video Generation
KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation
EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation
Are Images Indistinguishable to Humans Also Indistinguishable to Classifiers?
Masked Scene Modeling: Supervised-Level Performance with Self-Supervised Learning in 3D Scene Understanding
Rethinking Decoder Design: Improving Biomarker Segmentation Using Depth-to-Space Restoration and Residual Linear Attention
Comprehensive Information Bottleneck for Unveiling Universal Attribution to Interpret Vision Transformers
PBR-NeRF: Inverse Rendering with Physics-Based Neural Fields
Zero-Shot Image Restoration Using Few-Step Guidance of Consistency Models (and Beyond)
One2Any: One-Reference 6D Pose Estimation for Any Object
TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models
Adapting to Observation Length of Trajectory Prediction via Contrastive Learning
MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures
Exploration-Driven Generative Interactive Environments
ShapeShifter: 3D Variations Using Multiscale and Sparse Point-Voxel Diffusion
DefectFill: Realistic Defect Generation with Inpainting Diffusion Model for Visual Inspection
GliaNet: Adaptive Neural Network Structure Learning with Glia-Driven
T-CIL: Temperature Scaling using Adversarial Perturbation for Calibration in Class-Incremental Learning
Recognition-Synergistic Scene Text Editing
SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes
UniK3D: Universal Camera Monocular 3D Estimation
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
GCC: Generative Color Constancy via Diffusing a Color Checker
Text Augmented Correlation Transformer For Few-shot Classification & Segmentation
DiverseFlow: Sample-Efficient Diverse Mode Coverage in Flows
ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models
Instant Adversarial Purification with Adversarial Consistency Distillation
TSAM: Temporal SAM Augmented with Multimodal Prompts for Referring Audio-Visual Segmentation
Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects
RENO: Real-Time Neural Compression for 3D LiDAR Point Clouds
Toward Robust Neural Reconstruction from Sparse Point Sets
Mono3DVLT: Monocular-Video-Based 3D Visual Language Tracking
Difference Inversion : Interpolate and Isolate the Difference with Token Consistency for Image Analogy Generation
TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation
A Compound 3D-Informed Design toward Spatially-Intelligent Large Multimodal Models
Tightening Robustness Verification of MaxPool-based Neural Networks via Minimizing the Over-Approximation Zone
H-MoRe: Learning Human-centric Motion Representation for Action Analysis
MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations
Arc2Avatar: Generating Expressive 3D Avatars from a single image via ID Guidance
Image-Referenced Sketch Colorization Based on Animation Creation Workflow
Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models
Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation
A Regularization-Guided Equivariant Approach for Image Restoration
Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image Segmentation
Anomaly Anything: Promptable Unseen Visual Anomaly Generation
Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model
MixerMDM: Learnable Mixing of Human Motion Diffusion Models
Multi-Modal Aerial-Ground Cross-View Place Recognition with Neural ODEs
Improving Accuracy and Calibration via Differentiated Deep Mutual Learning
ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices
Hierarchy-Aware Evaluation of Free-Form Predictions From Vision-And-Language Models
LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation
Multi-focal Conditioned Latent Diffusion for Person Image Synthesis
Balancing Two Classifiers via A Simplex ETF Structure for Model Calibration
Stop learning it all to mitigate visual hallucination, Focus on the hallucination target.
Directional Label Diffusion Model for Learning from Noisy Labels
Watermarking One for All: A Robust Watermarking Scheme Against Partial Image Theft
ILIAS: Instance-Level Image retrieval At Scale
UltraFusion: Ultra High Dynamic Imaging using Exposure Fusion
ViUniT: Visual Unit Tests for More Robust Visual Programming
Semantic-guided Cross-Model Prompt Learning for skeleton-based zero-shot action recognition
Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval
RSVOS-SAM: High-Quality Interactive Segmentation for Remote Sensing Video Object
Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval
StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation
Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models
Multi-Modal Contrastive Masked Autoencoders: A Two-Stage Progressive Pre-training Approach for RGBD Datasets
Removing Reflections from RAW Photos
Convex Combination Star Shape Prior for Data-driven Image Semantic Segmentation
Matrix3D: Large Photogrammetry Model All-in-One
Collaborative Decoding Makes Visual Autoregressive Modeling Efficient
EffiDec3D: An Optimized Decoder for High-Performance and Efficient 3D Medical Image Segmentation
Low-Rank Adaptation with Token Selection for Point Cloud Learning
TreeMeshGPT: Artistic Mesh Generation with Autoregressive Tree Sequencing
GuardSplat: Efficient and Robust Watermarking for 3D Gaussian Splatting
Text Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps
Decoupling Training-Free Guided Diffusion by ADMM
Optimal Transport-Guided Source-Free Adaptation for Face Anti-Spoofing
Multi-layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices
RandAR: Decoder-only Autoregressive Visual Generation in Random Orders
Model Diagnosis and Correction via Linguistic and Implicit Attribute Editing
MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds
Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning
Traversing Distortion-Perception Tradeoff using a Single Score-Based Generative Model
ImViD: Immersive Volumetric Videos for Enhanced VR Engagement
SyncSDE: A Probabilistic Framework for Why Diffusion Synchronization Works
Decouple-Then-Merge: Finetune Diffusion Models as Multi-Task Learning
CoMapGS: Covisiblility Map-based Gaussian Splatting for Sparse Novel View Synthesis
Brain-Inspired Spiking Neural Networks for Energy-Efficient Object Detection
Fish-Vista: A Multi-Purpose Dataset for Understanding & Identification of Traits from Images
ROD-MLLM: Towards More Reliable Object Detection in Multimodal Large Language Models
Poly-Autoregressive Prediction for Modeling Interactions
Multi-Granularity Class Prototype Topology Distillation for Class-Incremental Source-Free Unsupervised Domain Adaptation
Chebyshev Attention Depth Permutation Texture Network with Latent Texture Attribute Loss
Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly
Global-Local Tree Search in VLMs for 3D Indoor Scene Generation
Focus-N-Fix: Region-Aware Fine-Tuning for Text-to-Image Generation
Perturb-and-Revise: Flexible 3D Editing with Generative Trajectories
SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer
STDD: Spatio-Temporal Dual Diffusion for Video Generation
Exploring Intrinsic Normal Prototypes within a Single Image for Universal Anomaly Detection
PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation
FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering
DarkIR: Robust Low-Light Image Restoration
Learning Flow Fields in Attention for Controllable Person Image Generation
Leveraging 3D Geometric Priors in 2D Rotation Symmetry Detection
Scenario Dreamer: Vectorized Generative Simulation Environments for Autonomous Driving
VideoHandles: Editing 3D Object Compositions in Videos Using Video Generative Priors
Open Set Label Shift with Test Time Out-of-Distribution Reference
Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation
Pose Priors from Language Models
Probability Density Geodesics in Image Diffusion Latent Space
STEREO: A Two-Stage Framework for Adversarially Robust Concept Erasing from Text-to-Image Diffusion Models
Camera resection from known line pencils and a radially distorted scanline
Improved monocular depth prediction using distance transform over pre-semantic contours with self-supervised neural networks
Revisiting Backdoor Attacks against Large Vision-Language Models from Domain Shift
SAMBLE: Shape-Specific Point Cloud Sampling for an Optimal Trade-Off Between Local Detail and Global Uniformity
ESCAPE: Equivariant Shape Completion via Anchor Point Encoding
SphereUFormer: A U-Shaped Transformer for Spherical 360 Perception
Perceptual Inductive Bias Is What You Need Before Contrastive Learning
Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models
LatentHOI: On the Generalizable Hand Object Motion Generation with Latent Hand Diffusion
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models
Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking
Cross-Modal Space-Time Correspondence as a Contrastive Random Walk
DiffDNO: Diffusion Fourier Neural Operator
AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment
DnLUT: Ultra-Efficient Color Image Denoising via Channel-Aware Lookup Tables
Flash-Split: 2D Reflection Removal with Flash Cues and Latent Separation
Language Guided Concept Bottleneck Models for Interpretable Continual Learning
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints
Diff2Flow: Bridging the Gap between Diffusion and Flow Matching with Minimal Cost
vesselFM: A Foundation Model for Universal 3D Blood Vessel Segmentation
One Diffusion to Generate Them All
COBRA: COmBinatorial Retrieval Augmentation for Few-Shot Learning
FruitNinja: 3D Object Interior Texture Generation with Gaussian Splatting
AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis
Few-shot Personalized Scanpath Prediction
Zero-Shot Head Swapping in Real-World Scenarios
PreciseCam: Precise Camera Control for Text-to-Image Generation
SVFR: A Unified Framework for Generalized Video Face Restoration
FastVLM: Efficient Vision Encoding for Vision Language Models
ASIGN: An Anatomy-aware Spatial Imputation Graphic Network for 3D Spatial Transcriptomics
Adaptive Rectangular Convolution for Remote Sensing Pansharpening
A Semantic Knowledge Complementarity based Decoupling Framework for Semi-supervised Class-imbalanced Medical Image Segmentation
Hardware-Rasterized Ray-Based Gaussian Splatting
StoryGPT-V: Large Language Models as Consistent Story Visualizers
Leveraging Global Stereo Consistency for Category-Level Shape and 6D Pose Estimation from Stereo Images
AeSPa : Attention-guided Self-supervised Parallel imaging for MRI Reconstruction
Scale Efficient Training for Large Datasets
SimAvatar: Simulation-Ready Clothed Gaussian Avatars from Text
BLADE: Single-view Body Mesh Learning through Accurate Depth Estimation
Coherent 3D Portrait Video Reconstruction via Triplane Fusion
Mr. DETR: Multi-Route Training for Detection Transformers with Instructive Self-Attention
Assessing and Learning Alignment of Unimodal Vision and Language Models
VCR: Learning Appearance-Invariant Representation for Open-World Instance Segmentation
Token Cropr: Faster ViTs for Quite a Few Tasks
Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content
NADER: Neural Architecture Design via Multi-Agent Collaboration
Data Synthesis with Diverse Styles for Face Recognition via 3DMM-Guided Diffusion
FedAWA: Adaptive Optimization of Aggregation Weights in Federated Learning Using Client Vectors
ICFace: Identity Code for Face Recognition at Scale
Learning with Noisy Triplet Correspondence for Composed Image Retrieval
Explicit Depth-Aware Blurry Video Frame Interpolation Guided by Differential Curves
Zero-Shot Styled Text Image Generation, but Make It Autoregressive
Distilled Prompt Learning for Incomplete Multimodal Survival Prediction
DRiVE: Diffusion-based Rigging Empowers Generation of Versatile and Expressive Characters
POp-GS: Next Best View in 3D-Gaussian Splatting with P-Optimality
SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes
Zero-shot RGB-D Point Cloud Registration with Pre-trained Large Vision Model
HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation
DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation
The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion
EchoONE: Segmenting Multiple echocardiography Planes in One Model
MVSAnywhere: Zero Shot Multi-View Stereo
From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech
Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy
Factored-NeuS: Reconstructing Surfaces, Illumination, and Materials of Possibly Glossy Objects
Using diffusion priors for video amodal segmentation
High Temporal Consistency through Semantic Similarity Propagation in Semi-Supervised Video Semantic Segmentation for Autonomous Flight
MammAlps: A multi-view video dataset of wild mammals behavior monitoring in the Swiss Alps
PICD: Versatile Perceptual Image Compression with Diffusion Rendering
Incomplete Multi-View Multi-label Learning via Disentangled Representation and Label Semantic Embedding
Efficient Decoupled Feature 3D Gaussian Splatting via Hierarchical Compression
Universal Actions for Enhanced Embodied Foundation Models
GBC: Generalizable Gaussian-Based Clothed Human Digitalization
DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation
OFER: Occluded Face Expression Reconstruction
Task-Agnostic Guided Feature Expansion for Class-Incremental Learning
Continual SFT Matches Multimodal RLHF with Negative Supervision
Prof. Robot: Differentiable Robot Rendering Without Static and Self-Collisions
HOT: Hadamard-based Optimized Training
PointSR: Self-regularized Point Supervision for Drone-view Object Detection
ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large language Models
ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer
Dynamic Stereotype Theory Induced Micro-expression Recognition with Oriented Deformation
Feature-Preserving Mesh Decimation for Normal Integration
Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback
Towards Million-Scale Adversarial Robustness Evaluation With Stronger Individual Attacks
PersonaBooth: Personalized Text-to-Motion Generation
Point Cloud Upsampling Using Conditional Diffusion Module with Adaptive Noise Suppression
Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration
DAR: Scalable Autoregressive Monocular Depth Estimation
From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing
PCM : Picard Consistency Model for Fast Parallel Sampling of Diffuson Models
OffsetOPT: Explicit Surface Reconstruction without Normals
Revisiting MAE pre-training for 3D medical image segmentation
LOD-GS: Achieving Level of Details using Scalable Gaussian Soup
3D-AVS: LiDAR-based 3D Auto-Vocabulary Segmentation
Realistic Test-Time Adaptation of Vision-Language Models
ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models
Consistent and Controllable Image Animation with Motion Diffusion Models
EasyHOI: Unleashing the Power of Large Models for Reconstructing Hand-Object Interactions in the Wild
Novel View Synthesis with Pixel-Space Diffusion Models
DroneSplat: 3D Gaussian Splatting for Robust 3D Reconstruction from In-the-Wild Drone Imagery
CADCrafter: Generating Computer-Aided Design Models from Unconstrained Images
SVDC: Consistent Direct Time-of-Flight Video Depth Completion with Frequency Selective Fusion
Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics
Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations
Illumination Spectrum Estimation for Multispectral Images via Surface Reflectance Modeling and Spatial-Spectral Feature Generation
Volumetric Surfaces: Representing Fuzzy Geometries with Layered Meshes
RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness
Reloc3r: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization
FreeTimeGS: Free Gaussians at Anytime Anywhere for Dynamic Scene Reconstruction
Can Machines Understand Composition? Dataset and Benchmark for Photographic Image Composition Embedding and Understanding
V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents
Dense Match Summarization for Faster Two-view Estimation
SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos
Segmenting Maxillofacial Structures in CBCT Volume
Diffusion Bridge: Leveraging Diffusion Model to Reduce the Modality Gap Between Text and Vision for Zero-Shot Image Captioning
IceDiff: High Resolution and High-Quality Arctic Sea Ice Forecasting with Generative Diffusion Prior
Focusing on Tracks for Online Multi-Object Tracking
Dual-view X-ray Detection: Can AI Detect Prohibited Items from Dual-view X-ray Images like Humans?
Black Swan: Abductive and Defeasible Video Reasoning in Unexpected Events
Embracing Collaboration Over Competition: Condensing Multiple Prompts for Visual In-Context Learning
CXPMRG-Bench: Pre-training and Benchmarking for X-ray Medical Report Generation on CheXpert Plus Dataset
VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness
Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations
MVPortrait: Text-Guided Motion and Emotion Control for Multi-view Vivid Portrait Animation
Visual Persona: Foundation Model for Full-Body Human Customization
ColabSfM: Collaborative Structure-from-Motion by Point Cloud Registration
Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better
HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories
UNICL-SAM: Uncertainty-Driven In-Context Segmentation with Part Prototype Discovery
Bringing CLIP to the Clinic: Dynamic Soft Labels and Negation-Aware Learning for Medical Analysis
AG-VPReID: A Challenging Large-Scale Benchmark for Aerial-Ground Video-based Person Re-Identification
Boost Your Human Image Generation Model via Direct Preference Optimization
ReDiffDet: Rotation-equivariant Diffusion Model for Oriented Object Detection
Towards Precise Embodied Dialogue Localization via Causality Guided Diffusion
SDGOCC: Semantic and Depth-Guided Bird’s-Eye View Transformation for 3D Multimodal Occupancy Prediction
FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation
FeedEdit: Text-Based Image Editing with Dynamic Feedback Regulation
VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis
PSA-SSL: Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds
Shadow Generation Using Diffusion Model with Geometry Prior
Rethinking Reconstruction and Denoising in the Dark: New Perspective, General Architecture and Beyond
AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer
All-directional Disparity Estimation for Real-world QPD Images
QMambaBSR: Burst Image Super-Resolution with Query State Space Model
Composing Parts for Expressive Object Generation
Plug-and-Play Interpretable Responsible Text-to-Image Generation via Dual-Space Multi-facet Concept Control
Low-Rank Adaptation in Multilinear Operator Networks for Security-Preserving Incremental Learning
LEDiff:Latent Exposure Diffusion for HDR Generation
Inversion Circle Interpolation: Diffusion-based Image Augmentation for Data-scarce Classification
GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration
SMTPD: A New Benchmark for Temporal Prediction of Social Media Popularity
Beyond Generation: A Diffusion-based Low-level Feature Extractor for Detecting AI-generated Images
Towards Training-free Anomaly Detection with Vision and Language Foundation Models
SURGEON: Memory-Adaptive Fully Test-Time Adaptation via Dynamic Activation Sparsity
Two is Better than One: Efficient Ensemble Defense for Robust and Compact Models
Decision SpikeFormer: Spike-Driven Transformer for Decision Making
ManiVideo: Generating Hand-Object Manipulation Video with Dexterous and Generalizable Grasping
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation
Dual Consolidation for Pre-Trained Model-Based Domain-Incremental Learning
Positive2Negative: Breaking the Information-Lossy Barrier in Self-Supervised Single Image Denoising
HIIF: Hierarchical Encoding based Implicit Image Function for Continuous Super-resolution
Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding
Noise Calibration and Spatial-Frequency Interactive Network for STEM Image Enhancement
Dragin3D: Image Editing by Dragging in 3D Space
Training-free Dense-Aligned Diffusion Guidance for Modular Conditional Image Synthesis
PosterMaker: Towards High-Quality Product Poster Generation with Accurate Text Rendering
StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements
NTR-Gaussian: Nighttime Thermal Reconstruction with 4D Gaussian Splatting Based on Thermodynamics
SplatFlow: Self-Supervised Dynamic Gaussian Splatting in Neural Motion Flow Field for Autonomous Driving
XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding
Reasoning Mamba: Hypergraph-Guided Region Relation Calculating for Weakly Supervised Affordance Grounding
Robust Multimodal Survival Prediction with the Latent Differentiation Conditional Variational AutoEncoder
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
PEER pressure: Model-to-Model Regularization for Single Source Domain Generalization
Classifier-Free Guidance inside the Attraction Basin May Cause Memorization
WeatherGen: A Unified Diverse Weather Generator for LiDAR Point Clouds via Spider Mamba Diffusion
Face Forgery Video Detection via Temporal Forgery Cue Unraveling
Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards
Designing Scale-Wise Transformers for Text-to-Image Synthesis
Towards Continual Universal Segmentation
Effortless Active Labeling for Long-Term Test-Time Adaptation
SeCap: Self-Calibrating and Adaptive Prompts for Cross-view Person Re-Identification in Aerial-Ground Networks
PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset
Robustness Analysis: Are Optical Flow Methods Safe to Use?
AniGrad: Anisotropic Gradient-Adaptive Resolution for 3D Reconstruction From Monocular Video
SAM2Object: Consolidating View Consistency via SAM2 for Zero-Shot 3D Instance Segmentation
Diffusion Renderer: Neural Inverse and Forward Rendering with Video Diffusion Models
WildGS-SLAM: Monocular Gaussian Splatting SLAM in Dynamic Environments
Cropper: Vision-Language Model for Image Cropping through In-Context Learning
PatchDEMUX: A Certifiably Robust Framework for Multi-label Classifiers Against Adversarial Patches
Hypergraph Vision Transformers: Images are More than Nodes, More than Edges
Tuning the Frequencies: Robust Training for Sinusoidal Neural Networks
Efficient Event-Based Object Detection: A Hybrid Neural Network with Spatial and Temporal Attention
Efficient Fine-Tuning and Concept Suppression for Pruned Diffusion Models
Nested Diffusion Models using Hierarchical Latent Priors
RainyGS: Efficient Rain Synthesis with Physically-Based Gaussian Splatting
Towards Transformer-Based Aligned Generation with Self-Coherence Guidance
Language-Guided Image Tokenization for Generation
SnapMem: Snapshot-based 3D Scene Memory for Embodied Exploration and Reasoning
Autoregressive Distillation of Diffusion Transformers
Multitwine: Multi-Object Compositing with Text and Layout Control
PAVE: Patching and Adapting Video Large Language Models
Order-One Rolling Shutter Cameras
Re-thinking Temporal Search for Long-Form Video Understanding
A Tale of Two Classes: Adapting Supervised Contrastive Learning to Binary Imbalanced Datasets
ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration
RoboGround: Robot Manipulation with Grounded Vision-Language Priors
Towards Human-Understandable Multi-Dimensional Concept Discovery
SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language
GenManip: A Simulation Platform for Generalizable TableTop Manipulation in the Era of MLLM
FIction: 4D Future Interaction Prediction from Video
CoSDH: Communication-Efficient Collaborative Perception via Supply-Demand Awareness and Intermediate-Late Hybridization
Rethinking Spiking Self-Attention Mechanism: Implementing α-XNOR Similarity Calculation in Spiking Transformers
CacheQuant: Comprehensively Accelerated Diffusion Models
Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene
HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion
What’s in the Image? A Deep-Dive into the Vision of Vision Language Models
Deterministic Image-to-Image Translations via Brownian Bridge Denoising Models with Dual Approximators
Localized Concept Erasure for Text-to-Image Diffusion Models Using Training-Free Gated Low-Rank Adaptation
CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI
Temporal Alignment-Free Video Matching for Few-shot Action Recognition
Improving Transferable Targeted Attacks with Feature Tuning Mixup
Towards More General Video-based Deepfake Detection through Facial Feature Guided Adaptation for Foundation Model
Floxels: Fast Unsupervised Voxel Based Scene Flow Estimation
Improving Semi-Supervised Semantic Segmentation with Sliced-Wasserstein Feature Alignment and Uniformity
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
HSI-GPT: A General-Purpose Large Scene-Motion-Language Model for Human Scene Interaction
D^3: Scaling Up Deepfake Detection by Learning from Discrepancy
Learning from Neighbors: Category Extrapolation for Long-Tail Learning
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
Overcoming Shortcut Problem in VLM for Robust Out-of-Distribution Detection
Practical solutions to the relative pose of three calibrated cameras
ABC-Former: Auxiliary Bimodal Cross-domain Transformer with Interactive Channel Attention for White Balance
EchoTraffic: Enhancing Traffic Anomaly Understanding with Audio-Visual Insights
SuperLightNet: Lightweight Parameter Aggregation Network for Multimodal Brain Tumor Segmentation
ViiNeuS: Volumetric Initialization for Implicit Neural Surface reconstruction of urban scenes with limited image overlap
Complementary Advantages: Exploiting Cross-Field Frequency Correlation for NIR-Assisted Image Denoising
RL-RC-DoT: A Block-level RL agent for Task-Aware Video Compression
EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models
MobileH2R: Learning Generalizable Human to Mobile Robot Handover Exclusively from Scalable and Diverse Synthetic Data
A Unified Latent Schrödinger Bridge Diffusion Model for Unsupervised Anomaly Detection and Localization
Just Dance with $\pi$! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection
Apply Hierarchical-Chain-of-Generation to Complex Attributes Text-to-3D Generation
A Unified Framework for Heterogeneous Semi-supervised Learning
One is Plenty: A Polymorphic Feature Interpreter for Immutable Heterogeneous Collaborative Perception
EnliveningGS: Active Locomotion of 3DGS
Activating Sparse Part Concepts for 3D Class Incremental Learning
3DFastEdit: Training-Free Fast and Controllable 3D Editing
Adapting Dense Matching for Homography Estimation with Grid-based Acceleration
MaterialFusion: High-Quality, Zero-Shot, and Controllable Material Transfer with Diffusion Models
ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation
DeSplat: Decomposed Gaussian Splatting for Distractor-Free Rendering
ProReflow: Progressive Reflow with Decomposed Velocity
Generalized Zero-Shot Classification via Semantics-Free Inter-Class Feature Generation
Chat-based Person Retrieval via Dialogue-Refined Cross-Modal Alignment
A Selective Re-learning Mechanism for Hyperspectral Fusion Imaging
Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation
Cross-Rejective Open-Set SAR Image Registration
R2C: Mapping Room to Chessboard to Unlock LLM As Low-Level Action Planner
VI$^3$NR: Variance Informed Initialization for Implicit Neural Representations
DFM: Differentiable Feature Matching for Anomaly Detection
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model
CustAny: Customizing Anything from A Single Example
ReCap: Better Gaussian Relighting with Cross-Environment Captures
MobilePortrait: Real-Time One-Shot Neural Head Avatars on Mobile Devices
Deterministic Certification of Graph Neural Networks against Poisoning Attacks with Arbitrary Perturbations
Controllable Dynamic 3D Scene Editing with Personalized 2D Diffusion
Adv-CPG: A Customized Portrait Generation Framework with Facial Adversarial Attacks
Charm: The Missing Piece in ViT fine-tuning for Image Aesthetic Assessment
Decoupled Motion Expression Video Segmentation
Geometry in Style: 3D Stylization via Surface Normal Deformation
You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale
OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation
Detection-Friendly Nonuniformity Correction: A Union Framework for Infrared UAV Target Detection
DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention
L-SWAG: Layer-Sample Wise Activation with Gradients information for Zero-Shot NAS on Vision Transformers
ODA-GAN: Orthogonal Decoupling Alignment GAN Assisted by Weakly-supervised Learning for Virtual Immunohistochemistry Staining
CaricatureBooth: Data-Free Interactive Caricature Generation in a Photo Booth
Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks
FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video
Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction
CoMatcher: Multi-View Collaborative Feature Matching
Occlusion-aware Text-Image-Point Cloud Pretraining for Open-World 3D Object Recognition
VideoGEM: Training-free Action Grounding in Videos
LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant
EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing
AFL: A Single-Round Analytic Approach for Federated Learning with Pre-trained Models
Learning Affine Correspondences by Integrating Geometric Constraints
RCP-Bench: Benchmarking Robustness for Collaborative Perception Under Diverse Corruptions
DepthCues: Evaluating Monocular Depth Perception in Large Vision Models
MIRE: Matched Implicit Neural Representations
SINR: Sparsity Driven Compressed Implicit Neural Representations
Spotting the Unexpected (STU): A 3D LiDAR Dataset for Anomaly Segmentation in Autonomous Driving
Volumetrically Consistent 3D Gaussian Rasterization
ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning
SpiritSight Agent: Advanced GUI Agent with One Look
Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning
Omni-Scene: Omni-Gaussian Representation for Ego-Centric Sparse-View Scene Reconstruction
SDBF: Steep-Decision-Boundary Fingerprinting for Hard-Label Tampering Detection of DNN Models
Efficient Dynamic Scene Editing via 4D Gaussian-based Static-Dynamic Separation
Enhancing Adversarial Transferability with Checkpoints of a Single Model’s Training
MESC-3D:Mining Effective Semantic Cues for 3D Reconstruction from a Single Image
Enhancing Creative Generation on Stable Diffusion-based Models
The Change You Want To Detect: Semantic Change Detection In Earth Observation With Hybrid Data Generation
Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition
Continuous Space-Time Video Resampling with Invertible Motion Steganography
Deformable Radial Kernel Splatting
Revisiting Fairness in Multitask Learning: A Performance-Driven Approach for Variance Reduction
Multi-modal Contrastive Learning with Negative Sampling Calibration for Phenotypic Drug Discovery
Enduring, Efficient and Robust Trajectory Prediction Attack in Autonomous Driving via Optimization-Driven Multi-Frame Perturbation Framework
AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation
Attraction Diminishing and Distributing for Few-Shot Class-Incremental Learning
GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving
Accelerating Diffusion Transformer via Increment-Calibrated Caching with Channel-Aware Singular Value Decomposition
DirectTriGS: Triplane-based Gaussian Splatting Field Representation for 3D Generation
Enhancing Privacy-Utility Trade-offs to Mitigate Memorization in Diffusion Models
Balanced Direction from Multifarious Choices: Arithmetic Meta-Learning for Domain Generalization
Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning
Mimic In-Context Learning for Multimodal Tasks
Post-pre-training for Modality Alignment in Vision-Language Foundation Models
DeDe: Detecting Backdoor Samples for SSL Encoders via Decoders
All-Day Multi-Camera Muti-Target Tracking
Tora: Trajectory-oriented Diffusion Transformer for Video Generation
ACL: Activating Capability of Linear Attention for Image Restoration
Image Quality Assessment: Investigating Causal Perceptual Effects with Abductive Counterfactual Inference
SKDream: Controllable Multi-view and 3D Generation with Arbitrary Skeletons
Reproducible Vision-Language Models Meet Concepts Out of Pre-Training
Iterative Predictor-Critic Code Decoding for Real-World Image Dehazing
Image Reconstruction from Readout-Multiplexed Single-Photon Detector Arrays
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation
Adventurer: Optimizing Vision Mamba Architecture Designs for Efficiency
Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis
V2V3D: View-to-View Denoised 3D Reconstruction for Light Field Microscopy
SynthLight: Portrait Relighting with Diffusion Model by Learning to Re-render Synthetic Faces
Visual Representation Learning through Causal Intervention for Controllable Image Editing
HoGS: Unified Near and Far Object Reconstruction via Homogeneous Gaussian Splatting
Text-Driven Fashion Image Editing with Compositional Concept Learning and Counterfactual Abduction
MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation
VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models
$Neuron$: Learning Context-Aware Evolving Representations for Zero-Shot Skeleton Action Recognition
One Model for ALL: Low-Level Task Interaction Is a Key to Task-Agnostic Image Fusion
Exposure-slot: Exposure-centric representations learning with Slot-in-Slot Attention for Region-aware Exposure Correction
CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering
Separation of powers: On segregating knowledge from observation in LLM-enabled knowledge-based visual question answering
Parameter Efficient Mamba Tuning via Projector-targeted Diagonal-centric Linear Transformation
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
WISH: Weakly Supervised Instance Segmentation using Heterogeneous Labels
4DGC: Rate-Aware 4D Gaussian Compression for Efficient Streamable Free-Viewpoint Video
Do Visual Imaginations Improve Vision-and-Language Navigation Agents?
GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding
Multi-modal Topology-embedded Graph Learning for Spatially Resolved Genes Prediction from Pathology Images with Prior Gene Similarity Information
Dynamic Content Prediction with Motion-aware Priors for Blind Face Video Restoration
Structure from Collision
ReRAW: RAW-from-RGB Image Reconstruction via Stratified Sampling for Efficient Object Detection on the Edge
Linear Attention Modeling for Learned Image Compression
3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting
Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs or Effective Long Video Analysis with LLMs
ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding
UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing
AlphaPre: Amplitude-Phase Disentanglement Model for Precipitation Nowcasting
DAMM-Diffusion: Learning Divergence-Aware Multi-Modal Diffusion Model for Nanoparticles Distribution Prediction
MLVU: Benchmarking Multi-task Long Video Understanding
Test-time augmentation improves efficiency in conformal prediction
High-Fidelity Lightweight Mesh Reconstruction from Point Clouds
OmniGen: Unified Image Generation
Towards Vision Language Models For Extra-Long Video Understanding
Free360: Layered Gaussian Splatting for Unbounded 360-Degree View Synthesis from Extremely Sparse and Unposed Views
Free Lunch Enhancements for Multi-modal Crowd Counting
Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation
Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows
SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting
FedCS: Coreset Selection for Federated Learning
Unified Medical Lesion Segmentation via Self-referring Indicator
Rethinking Diffusion for Text-Driven Human Motion Generation
Decoder Gradient Shield: Provable and High-Fidelity Prevention of Gradient-Based Box-Free Watermark Removal
SPARS3R: Semantic Prior Alignment and Regularization for Sparse 3D Reconstruction
Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
GeoMM: On Geodesic Perspective for Multi-modal Learning
IncEventGS: Pose-Free Gaussian Splatting from a Single Event Camera
Temporal Separation with Entropy Regularization for Knowledge Distillation in Spiking Neural Networks
Adaptive Unimodal Regulation for Balanced Multimodal Information Acquisition
Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions
STAA-SNN: Spatial-Temporal Attention Aggregator for Spiking Neural Networks
SemanticDraw: Towards Real-Time Interactive Content Creation from Image Diffusion Models
Efficient stereo depth estimation model for wearable augmented reality devices
Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward
Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation
Scene-Centric Unsupervised Panoptic Segmentation
FlexDrive: Toward Trajectory Flexibility in Driving Scene Reconstruction and Rendering
Calibrated Multi-Preference Optimization for Aligning Diffusion Models
Wavelet and Prototype Augmented Query-based Transformer for Pixel-level Surface Defect Detection
Consistency Posterior Sampling for Diverse Image Synthesis
Towards Stable and Storage-efficient Dataset Distillation: Matching Convexified Trajectory
GS-2DGS: Geometrically supervised 2DGS for reflective object reconstruction
Coeff-Tuning: A Filter Subspace View for Tuning Attention-Based Large Models
Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation
DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation
DIO: Decomposable Implicit 4D Occupancy-Flow World Model
StdGEN: Semantic-Decomposed 3D Character Generation from Single Images
DaCapo: Score Distillation as Stacked Bridge for Fast and High-quality 3D Editing
OmniStyle: Filtering High Quality Style Transfer Data at Scale
Flexible Group Count Enables Hassle-Free Structured Pruning
Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis
Unified Reconstruction of Static and Dynamic Scenes from Events
SpecTRe-GS: Modeling Highly Specular Surfaces with Reflected Nearby Objects by Tracing Rays in 3D Gaussian Splatting
EigenGS Representation: From Eigenspace to Gaussian Image Space
From Laboratory to Real World: A New Benchmark Towards Privacy-Preserved Visible-Infrared Person Re-Identification
Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds
FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training
Learning on Model Weights using Tree Experts
Self-Expansion of Pre-trained Models with Mixture of Adapters for Continual Learning
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
CholecTrack20: A Multi-Perspective Tracking Dataset for Surgical Tools
Weakly Supervised Semantic Segmentation via Progressive Confidence Region Expansion
EventSplat: 3D Gaussian Splatting from Moving Event Cameras for Real-time Rendering
Language-Assisted Debiasing and Smoothing for Foundation Model-Based Semi-Supervised Learning
MambaOut: Do We Really Need Mamba for Vision?
Closest Neighbors are Harmful for Lightweight Masked Auto-encoders
Plug-and-Play PPO: An Adaptive Point Prompt Optimizer Making SAM Greater
CDI: Copyrighted Data Identification in Diffusion Models
GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery
VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation
Learning to Highlight Audio by Watching Movies
Geometry-guided Online 3D Video Synthesis with Multi-View Temporal Consistency
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
Enhancing Testing-Time Robustness for Trusted Multi-View Classification in the Wild
Taming Teacher Forcing for Masked Autoregressive Video Generation
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Plug-and-Play Proximal Restoration Priors for Single-Pixel Imaging
MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
DiSRT-In-Bed: Diffusion-Based Sim-to-Real Transfer Framework for In-Bed Human Mesh Recovery
Potential Field based Metric Learning
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
Targeted Forgetting of Image Subgroups in CLIP Models
Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens
Light Transport-aware Diffusion Posterior Sampling for Single-View Reconstruction of 3D Volumes
AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea
HuMoCon: Concept Discovery for Human Motion Understanding
Blind-Spot Real-world Image Denoising via Implicit Neural Pixel Resampling
Towards Explainable and Unprecedented Accuracy in Matching Challenging Finger Crease Patterns
Concept Replacer: Replacing Sensitive Concepts in Diffusion Models via Precision Localization
VEU-Bench: Towards Comprehensive Understanding of Video Editing
GenDeg: Diffusion-Based Degradation Synthesis for Generalizable All-in-One Image Restoration
High-quality Point Cloud Oriented Normal Estimation via Hybrid Angular and Euclidean Distance Encoding
ProbPose: A Probabilistic Approach to 2D Human Pose Estimation
DViN: Dynamic Visual Routing Network for Weakly Supervised Referring Expression Comprehension
FSboard: Over 3 million characters of ASL fingerspelling collected via smartphones
Beyond Local Sharpness: Communication-Efficient Global Sharpness-aware Minimization for Federated Learning
DyMO: Training-Free Diffusion Model Alignment with Dynamic Multi-Objective Scheduling
MonoInstance: Enhancing Monocular Priors via Multi-view Instance Alignment for Neural Rendering and Reconstruction
Multi-modal Knowledge Distillation-based Human Trajectory Forecasting
DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment
Computational Efficient and Recognition Friendly 3D Point Cloud Privacy Protection
Forensic Self-Descriptions Are All You Need for Zero-Shot Detection, Open-Set Source Attribution, and Clustering of AI-generated Images
DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding
Scaling Mesh Generation via Compressive Tokenization
Not All Parameters Matter: Masking Diffusion Models for Enhancing Generation Ability
HiMoR: Monocular Deformable Gaussian Reconstruction with Hierarchical Motion Representation
AniDoc: Animation Creation Made Easier
Towards Understanding How Knowledge Evolves in Large Vision-Language Models
SCSA: A Plug-and-Play Semantic Continuous-Sparse Attention for Arbitrary Semantic Style Transfer
DreamRelation: Bridging Customization and Relation Generation
UniRestore: Unified Perceptual and Task-Oriented Image Restoration Model Using Diffusion Prior
Seurat: From Moving Points to Depth
Exploiting Deblurring Networks for Radiance Fields
Exploring Temporally-Aware Features for Point Tracking
SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis
Boosting the Dual-Stream Architecture in Ultra-High Resolution Segmentation with Resolution-Biased Uncertainty Estimation
An Image-like Diffusion Method for Human-Object Interaction Detection
Unveiling Differences in Generative Models: A Scalable Differential Clustering Approach
GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking
Theoretical Insights in Model Inversion Robustness and Conditional Entropy Maximization for Collaborative Inference Systems
Large-scale Multi-view Tensor Clustering with Implicit Linear Kernels
PSHuman: Photorealistic Single-image 3D Human Reconstruction using Cross-Scale Multiview Diffusion and Explicit Remeshing
RelationField: Relate Anything in Radiance Fields
Rethinking Noisy Video-Text Retrieval via Relation-aware Alignment
Real-time Free-view Human Rendering from Sparse-view RGB Videos using Double Unprojected Textures
DrVideo: Document Retrieval Based Long Video Understanding
T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting
Color Alignment in Diffusion
Dual Semantic Guidence for Open Vocabulary Semantic Segmentation
BimArt: A Unified Approach for the Synthesis of 3D Bimanual Interaction with Articulated Objects
CheckManual: A New Challenge and Benchmark for Manual-based Appliance Manipulation
LongDiff: Training-Free Long Video Generation in One Go
PQPP: A Joint Benchmark for Text-to-Image Prompt and Query Performance Prediction
Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability
AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM
Uni4D: Unifying Large Vision Models for 4D Modeling from a Single Video
De$^2$Gaze: Deformable and Decoupled Representation Learning for 3D Gaze Estimation
Blood Flow Speed Estimation with Optical Coherence Tomography Angiography Images
Training-free Video Semantic Segmentation based on Diffusion Models
CroCoDL: Cross-device Collaborative Dataset for Localization
ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts
Digital Twin Catalog: A Large-Scale Photorealistic 3D Object Digital Twin Dataset
CoLLM: A Large Language Model for Composed Image Retrieval
Finsler Multi-Dimensional Scaling: Manifold Learning for Asymmetric Dimensionality Reduction and Embedding
IRIS: Inverse Rendering of Indoor Scenes from Low Dynamic Range Images
O-TPT: Orthogonality Constraints for Calibrating Test-time Prompt Tuning in Vision-Language Models
Descriptor-In-Pixel : Point-Feature Tracking For Pixel Processor Arrays
Finer-CAM: Spotting the Difference Reveals Finer Details for Visual Explanation
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues
UIBDiffusion: Universal Imperceptible Backdoor Attack for Diffusion Models
Gaussian Splatting for Efficient Satellite Image Photogrammetry
Seeing more with less: human-like representations in vision models
Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages
VolFormer: Explore More Comprehensive Cube Interaction for Hyperspectral Image Restoration and Beyond
Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models
TIDE: Training Locally Interpretable Domain Generalization Models Enables Test-time Correction
SAIST: Segment Any Infrared Small Target Model Guided by Contrastive Language-Image Pretraining
Prototype-Based Image Prompting for Weakly Supervised Histopathological Image Segmentation
Efficient Data Driven Mixture-of-Expert Extraction from Trained Networks
Sparse Point Cloud Patches Rendering via Splitting 2D Gaussians
Task Preference Optimization: Improving Multimodal Large Language Models Performance with Vision Task Alignment
Real-time High-fidelity Gaussian Human Avatars with Position-based Interpolation of Spatially Distributed MLPs
K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences
Generalized Diffusion Detector: Mining Robust Features from Diffusion Models for Domain-Generalized Detection
UNIALIGN: Scaling Multimodal Alignment within One Unified Model
HMAR: Efficient Hierarchical Masked AutoRegressive Image Generation
Towards Context-Stable and Hue-Consistent Image Inpainting
Gradient-Guided Annealing for Domain Generalization
The Devil is in Low-Level Features for Cross-Domain Few-Shot Segmentation
Dynamic Group Normalization: Spatio-Temporal Adaptation to Evolving Data Statistics
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
Forensics Adapter: Adapting CLIP for Generalizable Face Forgery Detection
Unsupervised Foundation Model-Agnostic Slide-Level Representation Learning
Where the Devil Hides: Deepfake Detectors Can No Longer Be Trusted
Learning to Anticipate Table Tennis Hits from Monocular Video
FreeUV: Ground-Truth-Free Realistic Facial UV Texture Recovery via Cross-Assembly Inference Strategy
Correcting Deviations from Normality: A Reformulated Diffusion Model for Multi-Class Unsupervised Anomaly Detection
Spectral State Space Model for Rotation-Invariant Visual Representation Learning
MultimodalStudio: A Heterogeneous Sensor Dataset and Framework for Neural Rendering across Multiple Imaging Modalities
Relative Pose Estimation through Affine Corrections of Monocular Depth Priors
CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP
Recurrent Feature Mining and Keypoint Mixup Padding for Category-Agnostic Pose Estimation
Spectral Informed Mamba for Robust Point Cloud Processing
NoT: Federated Unlearning via Weight Negation
Label Shift Meets Online Learning: Ensuring Consistent Adaptation with Universal Dynamic Regret
PCDreamer: Point Cloud Completion Through Multi-view Diffusion Priors
AesthetiQ: Enhancing Graphic Layout Design via Aesthetic-Aware Preference Alignment of Multi-modal Large Language Models
SaMam: Style-aware State Space Model for Arbitrary Image Style Transfer
Boosting Video Quality Assessment via Saliency-guided Local Perception
Learning Person-Specific Animatable Face Models from In-the-Wild Images via a Shared Base Model
COSMOS: Cross-Modality Self-Distillation for Vision Language Pretraining
POMP: Physics-consistent Human Motion Prior through Phase Manifolds
FLAIR: VLM with Fine-grained Language-informed Image Representations
CMMLoc: Advancing Text-to-PointCloud Localization with Cauchy-Mixture-Model Based Framework
D^3CTTA: Domain-Dependent Decorrelation for Continual Test-Time Adaption of 3D LiDAR Segmentation
Deterministic-to-Stochastic Diverse Latent Feature Mapping for Human Motion Synthesis
From Poses to Identity: Training-Free Person Re-Identification via Feature Centralization
MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High Intensity Surgical Environments
Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos
DropoutGS: Dropping Out Gaussians for Better Sparse-view Rendering
Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing
FluidNexus: 3D Fluid Reconstruction and Prediction from a Single Video
Deep Fair Multi-View Clustering with Attention KAN
Insightful Instance Features for 3D Instance Segmentation
Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key
Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail
AdaDARE-$\gamma$: Balancing Stability and Plasticity in Multi-modal LLMs through Efficient Adaptation
Spatiotemporal Decoupling for Efficient Vision-Based Occupancy Forecasting
CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction
MonoPlace3D: Learning 3D-Aware Object Placement for 3D Monocular Detection
Prior-free 3D Object Tracking
Ges3ViG : Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding
Unsupervised Continual Domain Shift Learning with Multi-Prototype Modeling
SnowMaster: Comprehensive Real-world Image Desnowing via MLLM with Multi-Model Feedback Optimization
PersonaHOI: Effortlessly Improving Personalized Face with Human-Object Interaction Generation
Embodied Scene Understanding for Vision Language Models via MetaVQA
Feature Selection for Latent Factor Models
Vision-Guided Action: Enhancing 3D Human Motion Prediction with Gaze-informed Affordance in 3D Scenes
IM-Portrait: Learning 3D-aware Video Diffusion for Photorealistic Talking Heads from Monocular Videos
High-fidelity 3D Object Generation from Single Image with RGBN-Volume Gaussian Reconstruction Model
Libra-Merging: Importance-redundancy and Pruning-merging Trade-off for Acceleration Plug-in in Large Vision-Language Model
Learning Class Prototype for Unified Sparse Supervised 3D Object Detection
GaussianUDF: Inferring Unsigned Distance Functions through 3D Gaussian Splatting
EquiPose: Exploiting Equivariance for Relative Camera Pose Estimation
Synthetic Data is an Elegant GIFT for Continual Vision-Language Models
Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency
PartRM: Modeling Part-Level Dynamics with Large 4D Reconstruction Model
LMO: Linear Mamba Operator for MRI Reconstruction
RigGS: Rigging of 3D Gaussians for Modeling Articulated Objects in Videos
Align-A-Video: Deterministic Reward Tuning of Image Diffusion Models for Consistent Video Editing
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection
PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability
MITracker: Multi-View Integration for Visual Object Tracking
Stochastic Human Motion Prediction with Memory of Action Transition and Action Characteristic
TaoAvatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting
Inference-Scale Complexity in ANN-SNN Conversion for High-Performance and Low-Power Applications
3D Dental Model Segmentation with Geometrical Boundary Preserving
USP-Gaussian: Unifying Spike-based Image Reconstruction, Pose Correction and Gaussian Splatting
SelfSplat: Pose-Free and 3D Prior-Free Generalizable 3D Gaussian Splatting
DiGIT: Multi-Dilated Gated Encoder and Central-Adjacent Region Integrated Decoder for Temporal Action Detection Transformer
End-to-End Implicit Neural Representations for Classification
MINIMA: Modality Invariant Image Matching
DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning
UniPose: A Unified MultiModal Framework for Human Pose Comprehension, Generation and Editing
MExD: An Expert-Infused Diffusion Model for Whole-Slide Image Classification
VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment
A General Adaptive Dual-level Weighting Mechanism for Remote Sensing Pansharpening
Vision-Language Embodiment for Monocular Depth Estimation
MambaVision: A Hybrid Mamba-Transformer Vision Backbone
Generating a Five-Second Video within Five Seconds on a Mobile Device
DefMamba: Deformable Visual State Space Model
Towards Practical Real-Time Neural Video Compression
AnomalyNCD: Towards Novel Anomaly Class Discovery in Industrial Scenarios
SoftShadow: Leveraging Soft Masks for Penumbra-Aware Shadow Removal
Decoupling Fine Detail and Global Geometry for Compressed Depth Map Super-Resolution
CADDreamer: CAD Object Generation from Single-view Images
ABBSPO: Adaptive Bounding Box Scaling and Symmetric Prior based Orientation Prediction for Detecting Aerial Image Objects
SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation
CASP: Consistency-aware Audio-induced Saliency Prediction Model for Omnidirectional Video
Towards Universal AI-Generated Image Detection by Variational Information Bottleneck Network
UMotion: Uncertainty-driven Human Motion Estimation from Inertial and Ultra-wideband Units
Bridging Modalities: Improving Universal Multimodal Retrieval by Multimodal Large Language Models
Towards Scalable Human-aligned Benchmark for Text-guided Image Editing
High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model
Resilient Sensor Fusion under Adverse Sensor Failures via Multi-Modal Expert Fusion
Object-aware Sound Source Localization via Audio-Visual Scene Understanding
Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing
Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders
Question-Aware Gaussian Experts for Audio-Visual Question Answering
SocialGesture: Delving into Multi-person Gesture Understanding
Plug-and-Play Versatile Compressed Video Enhancement
Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation
Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation
NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval
Single Domain Generalization for Few-Shot Counting via Universal Representation Matching
nnWNet: Rethinking the Use of Transformers in Biomedical Image Segmentation and Calling for a Unified Evaluation Benchmark
Learning Temporally Consistent Video Depth from Video Diffusion Priors
Style Evolving along Chain-of-Thought for Unknown-Domain Object Detection
Detect Any Mirrors: Boosting Learning Reliability on Large-Scale Unlabeled Data with an Iterative Data Engine
Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation
Visual Consensus Prompting for Co-Salient Object Detection
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level
Towards In-the-wild 3D Plane Reconstruction from a Single Image
Binarized Semantic Mamba-Transformer for Lightweight Quad Bayer HybridEVS Demosaicing
Exploiting Temporal State Space Sharing for Video Semantic Segmentation
Model Poisoning Attacks to Federated Learning via Multi-Round Consistency
ExpertAF: Expert Actionable Feedback from Video
Event-based Video Super-Resolution via State Space Models
MetricGrids: Arbitrary Nonlinear Approximation with Elementary Metric Grids based Implicit Neural Representation
LesionLocator: Zero-Shot Universal Tumor Segmentation and Tracking in 3D Whole-Body Imaging
DA-VPT: Semantic-Guided Visual Prompt Tuning for Vision Transformers
Semi-Supervised State-Space Model with Dynamic Stacking Filter for Real-World Video Deraining
Make It Count: Text-to-Image Generation with an Accurate Number of Objects
VISTREAM: Improving Computation Efficiency of Visual Perception Streaming via Law-of-Charge-Conservation Inspired Spiking Neural Network
MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation
Spiking Transformer with Spatial-Temporal Attention
Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception
Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation
Hierarchical Adaptive Filtering Network for Text Image Specular Highlight Removal
TailedCore : Few-Shot Sampling for Unsupervised Long-Tail Noisy Anomaly Detection
PIDLoc: Cross-View Pose Optimization Network Inspired by PID Controllers
VinaBench: Benchmark for Faithful and Consistent Visual Narratives
PMA: Towards Parameter-Efficient Point Cloud Understanding via Point Mamba Adapter
Adapting Pre-trained 3D Models for Point Cloud Video Understanding via Cross-frame Spatio-temporal Perception
Co-op: Correspondence-based Novel Object Pose Estimation
Mamba-Adaptor: State Space Model Adaptor for Visual Recognition
Generative Sparse-View Gaussian Splatting
Shift the Lens: Environment-Aware Unsupervised Camouflaged Object Detection
Exact: Exploring Space-Time Perceptive Clues for Weakly Supervised Satellite Image Time Series Semantic Segmentation
MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities
GRAE-3DMOT: Geometry Relation-Aware Encoder for Online 3D Multi-Object Tracking
Efficient ANN-Guided Distillation: Aligning Rate-based Features of Spiking Neural Networks through Hybrid Block-wise Replacement
SOAP: Vision-Centric 3D Semantic Scene Completion with Scene-Adaptive Decoder and Occluded Region-Aware View Projection
WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and Segmentation
RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete
MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting
MLLM-as-a-Judge for Image Safety without Human Labeling
Inverting Flow for Image Restoration
STEPS: Sequential Probability Tensor Estimation for Text-to-Image Hard Prompt Search
Generative Hard Example Augmentation for Semantic Point Cloud Segmentation
Certified Human Trajectory Prediction
Cheb-GR: Rethinking k-nearest neighbor search in Re-ranking for Person Re-identification
A Lightweight UDF Learning Framework for 3D Reconstruction Based on Local Shape Functions
PI-HMR: Towards Robust In-bed Temporal Human Shape Reconstruction with Contact Pressure Sensing
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching
Anatomical Consistency and Adaptive Prior-informed Transformation for Multi-contrast MR Image Synthesis via Diffusion Model
Improve Representation for Imbalanced Regression through Geometric Constraints
Auto-Enocded Supervision for Perceptual Image Super-Resolution
Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data
Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI
MICAS: Multi-grained In-Context Adaptive Sampling for 3D Point Cloud Processing
Hybrid Reciprocal Transformer with Triplet Feature Alignment for Scene Graph Generation
DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation
Adapter Merging with Centroid Prototype Mapping for Scalable Class-Incremental Learning
Towards Unbiased and Robust Spatio-Temporal Scene Graph Generation and Anticipation
Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization
Real-IAD D³: A Real-World 2D/Pseudo-3D/3D Dataset for Industrial Anomaly Detection
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning
AMR-Transformer: Enabling Efficient Long-range Interaction for Complex Neural Fluid Simulation
Enhancing Facial Privacy Protection via Weakening Diffusion Purification
VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step
Acc3D: Accelerating Single Image to 3D Diffusion Models via Edge Consistency Guided Score Distillation
VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide
Optical-Flow Guided Prompt Optimization for Coherent Video Generation
Video Motion Transfer with Diffusion Transformers
Self-Learning Hyperspectral and Multispectral Image Fusion via Adaptive Residual Guided Subspace Diffusion Model
DL2G: Degradation-guided Local-to-Global Restoration for Eyeglass Reflection Removal
Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions
Bridging the Gap between Diffusion Models and Universal Quantization for Image Compression
Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval
Shape Abstraction via Marching Differentiable Support Functions
Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
Point2RBox-v2: Rethinking Point-supervised Oriented Object Detection with Spatial Layout Among Instances
AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models
EdgeMovingNet: Edge-preserving Point Cloud Reconstruction via Joint Geometry Features
INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations
Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models
SAMWISE: Infusing wisdom in SAM2 for Text-Driven Video Segmentation
Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction
Asynchronous Collaborative Graph Representation for Frames and Events
MaRI: Material Retrieval Integration across Domains
Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation
Concept Lancet: Representation Decomposition and Transplant for Diffusion-Based Image Editing
Invisible Backdoor Attack against Self-supervised Learning
Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation
Kiss3DGen: Repurposing Image Diffusion Models for 3D Asset Generation
EEE-Bench: A Comprehensive Multimodal Electrical And Electronics Engineering Benchmark
NeRFPrior: Learning Neural Radiance Field as a Prior for Indoor Scene Reconstruction
BrepGiff: Lightweight Generation of Complex B-rep with 3D GAT Diffusion
StarVector: Generating Scalable Vector Graphics Code from Images and Text
Universal Domain Adaptation for Semantic Segmentation
HVI: A New color space for Low-light Image Enhancement
GraphMimic: Graph-to-Graphs Generative Modeling from Videos for Policy Learning
Human Motion Instruction Tuning
Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification
CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models
LIM: Large Interpolator Model for Dynamic Reconstruction
IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification
HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation
EntityErasure: Erasing Entity Cleanly via Amodal Entity Segmentation and Completion
Advancing Generalizable Tumor Segmentation with Anomaly-Aware Open-Vocabulary Attention Maps and Frozen Foundation Diffusion Models
VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
Reasoning in visual navigation of end-to-end trained agents: a dynamical systems approach
FFR:Frequecny Feature Rectification for Weakly Supervised Semantic Segmentation
Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens
Mitigating Ambiguities in 3D Classification with Gaussian Splatting
Minimizing Labeled, Maximizing Unlabeled: An Image-Driven Approach for Video Instance Segmentation
SparseAlign: a Fully Sparse Framework for Cooperative Object Detection
Robust-MVTON: Learning Cross-Pose Feature Alignment and Fusion for Robust Multi-View Virtual Try-On
FDS: Frequency-Aware Denoising Score for Text-Guided Latent Diffusion Image Editing
DropGaussian: Structural Regularization for Sparse-view Gaussian Splatting
Generating Multimodal Driving Scenes via Next-Scene Prediction
HaWoR: World-Space Hand Motion Reconstruction from Egocentric Videos
ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval
Learning to Sample Effective and Diverse Prompts for Text-to-Image Generation
Dataset Distillation with Neural Characteristic Function: A Minmax Perspective
STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection
Effective SAM Combination for Open-Vocabulary Semantic Segmentation
Rectified Diffusion Guidance for Conditional Generation
Lift3D Policy: Lifting 2D Foundation Models for Robust 3D Robotic Manipulation
EntropyMark: Towards More Harmless Backdoor Watermark via Entropy-based Constraint for Open-source Dataset Copyright Protection
SGFormer: Satellite-Ground Fusion for 3D Semantic Scene Completion
Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation
Simplification Is All You Need against Out-of-Distribution Overconfidence
Is `Right' Right? Enhancing Object Orientation Understanding in Multimodal Language Models through Egocentric Instruction Tuning
Towards Efficient Foundation Model for Zero-shot Amodal Segmentation
Distinguish Then Exploit: Source-free Open Set Domain Adaptation via Weight Barcode Estimation and Sparse Label Assignment
Camouflage Anything: Learning to Hide using Controlled Out-painting and Representation Engineering
Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing
Let's Verify and Reinforce Image Generation Step by Step
Random Conditioning for Diffusion Model Compression with Distillation
From Slow Bidirectional to Fast Causal Video Generator
U-Know-DiffPAN: An Uncertainty-aware Knowledge Distillation Diffusion Framework with Details Enhancement for PAN-Sharpening
Percept, Memory, and Imagine: World Feature Simulating for Open-Domain Unknown Object Detection
Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation
Seeing 3D World in A Grain of Sand
Chain of Semantics Programming in 3D Gaussian Splatting Representation for 3D Vision Grounding
Precise, Fast, and Low-cost Concept Erasure in Value Space: Orthogonal Complement Matters
EvOcc: Accurate Semantic Occupancy for Automated Driving Using Evidence Theory
NeISF++: Neural Incident Stokes Field for Polarized Inverse Rendering of Conductors and Dielectrics
Uncertainty Weighted Gradients for Model Calibration
MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving
FreeScene: Mixed Graph Diffusion for 3D Scene Synthesis from Free Prompts
Pos3R: 6D Pose Estimation for Unseen Objects Made Easy
One-shot 3D Object Canonicalization based on Geometric and Semantic Consistency
MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models
EDCFlow: Exploring Temporally Dense Difference Maps for Event-based Optical Flow Estimation
Three Cars Approaching within 100m! Enhancing Distant Geometry by Tri-Axis Voxel Scanning for Camera-based Semantic Scene Completion
Incorporating Dense Knowledge Alignment into Unified Multimodal Representation Models
Federated Learning with Domain Shift Eraser
Robotic Visual Instruction
Infighting in the Dark: Multi-Label Backdoor Attack in Federated Learning
H2ST: Hierarchical Two-Sample Tests for Continual Out-of-Distribution Detection
Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment
G3Flow: Generative 3D Semantic Flow for Pose-aware and Generalizable Object Manipulation
Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model
SoRA: Singular Value Decomposed Low-Rank Adaptation for Domain Generalizable Representation Learning
SATA: Spatial Autocorrelation Token Analysis for Enhancing the Robustness of Vision Transformers
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models
Tartan IMU: A Light Foundation Model for Inertial Positioning in Robotics
The Impact Label Noise and Choice of Threshold has on Cross-Entropy and Soft-Dice in Image Segmentation
GenAssets: Generating in-the-wild 3D Assets in Latent Space
MoST: Efficient Monarch Sparse Tuning for 3D Representation Learning
Large Self-Supervised Models Bridge the Gap in Domain Adaptive Object Detection
Repurposing Pre-trained Video Diffusion Models for Event-based Video Interpolation
SASep: Saliency-Aware Structured Separation of Geometry and Feature for Open Set Learning on Point Clouds
CamFreeDiff: Camera-free Image to Panorama Generation with Diffusion Model
Fancy123: One Image to High-Quality 3D Mesh Generation via Plug-and-Play Deformation
UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text
ScribbleLight: Single Image Indoor Relighting with Scribbles
MambaIC: State Space Models for High-Performance Learned Image Compression
Continuous Adverse Weather Removal via Degradation-Aware Distillation
UCM-VeID V2: A Richer Dataset and A Pre-training Method for UAV Cross-Modality Vehicle Re-Identification
PDFactor: Learning Tri-Perspective View Policy Diffusion Field for Multi-Task Robotic Manipulation
EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation
VideoDirector: Precise Video Editing via Text-to-Video Models
Active Data Curation Effectively Distills Large-Scale Multimodal Models
SeeGround: See and Ground for Zero-shot Open-Vocabulary 3D Visual Grounding
COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation
Self-supervised ControlNet with Spatio-Temporal Mamba for Real-world Video Super-resolution
Detecting Open World Objects via Partial Attribute Assignment
Revealing Key Details to See Differences: A Novel Prototypical Perspective for Skeleton-based Action Recognition
ShowMak3r: Compositional TV Show Reconstruction
Contextual AD Narration with Interleaved Multimodal Sequence
Neuro-3D: Towards 3D Visual Decoding from EEG Signals
Improving Editability in Image Generation with Layer-wise Memory
CarPlanner: Consistent Auto-regressive Trajectory Planning for Large-Scale Reinforcement Learning in Autonomous Driving
AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization
FIRE: Robust Detection of Diffusion-Generated Images via Frequency-Guided Reconstruction Error
Recovering Dynamic 3D Sketches from Videos
Hyperbolic Uncertainty-Aware Few-Shot Incremental Point Cloud Segmentation
Semantic and Sequential Alignment for Referring Video Object Segmentation
Beyond Image Classification: A Video Benchmark and Dual-Branch Hybrid Discrimination Framework for Compositional Zero-Shot Learning
STCOcc: Sparse Spatial-Temporal Cascade Renovation for 3D Occupancy and Scene Flow Prediction
When Domain Generalization meets Generalized Category Discovery: An Adaptive Task-Arithmetic Driven Approach
Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World
Deep Change Monitoring: A Hyperbolic Representative Learning Framework and a Dataset for Long-term Fine-grained Tree Change Detection
Reconstructing In-the-Wild Open-Vocabulary Human-Object Interactions
OSLoPrompt: Bridging Low-Supervision Challenges and Open-Set Domain Generalization in CLIP
Dual-Agent Optimization framework for Cross-Domain Few-Shot Segmentation
Leveraging SD Map to Augment HD Map-based Trajectory Prediction
Variance-Based Membership Inference Attacks Against Large-Scale Image Captioning Models
CraftsMan: High-fidelity Mesh Generation with 3D Native Generation and Interactive Geometry Refiner
Action Detail Matters: Refining Video Recognition with Local Action Queries
Learning Physics-Based Full-Body Human Reaching and Grasping from Brief Walking References
Devil is in the Detail: Towards Injecting Fine Details of Image Prompt in Image Generation via Conflict-free Guidance and Stratified Attention
Hyperbolic Category Discovery
Pioneering 4-Bit FP Quantization for Diffusion Models: Mixup-Sign Quantization and Timestep-Aware Fine-Tuning
Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics
$ShiftwiseConv$: Small Convolutional Kernel with Large Kernel Effect
ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model
Transformers without Normalization
DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models
Learning to Filter Outlier Edges in Global SfM
Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering
VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models
MFogHub: Bridging Multi-Regional and Multi-Satellite Data for Global Marine Fog Detection and Forecasting
Progressive Focused Transformer for Single Image Super-Resolution
Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation
Understanding Fine-tuning CLIP for Open-vocabulary Semantic Segmentation in Hyperbolic Space
DeClotH: Decomposable 3D Cloth and Human Body Reconstruction from a Single Image
GenPC: Zero-shot Point Cloud Completion via 3D Generative Priors
OralXrays-9: Towards Hospital-Scale Panoramic X-ray Anomaly Detection via Personalized Multi-Object Query-Aware Mining
Believing is Seeing: Unobserved Object Detection using Generative Models
HomoGen: Enhanced Video Inpainting via Homography Propagation and Diffusion
Unsupervised Discovery of Facial Landmarks and Head Pose
Neural Video Compression with Context Modulation
Three-view Focal Length Recovery From Homographies
Augmented Deep Contexts for Spatially Embedded Video Coding
SerialGen: Personalized Image Generation by First Standardization Then Personalization
StableAnimator: High-Quality Identity-Preserving Human Image Animation
Joint Optimization of Neural Radiance Fields and Continuous Camera Motion from a Monocular Video
Hyperspectral Pansharpening via Diffusion Models with Iteratively Zero-Shot Guidance
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
3D-MVP: 3D Multiview Pretraining for Robotic Manipulation
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
Fitted Neural Lossless Image Compression
MagicQuill: An Intelligent Interactive Image Editing System
Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding
FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression
DyCON: Dynamic Uncertainty-aware Consistency and Contrastive Learning for Semi-supervised Medical Image Segmentation
LLaVA-Critic: Learning to Evaluate Multimodal Models
Concept Preservation and Unbinding in Continual Diffusion Customization
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts
Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts
Can't Slow me Down: Learning Robust and Hardware-Adaptive Object Detectors against Latency Attacks for Edge Devices
SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments
DiffVsgg: Diffusion-based Online Video Scene Graph Generation
Link to the Past: Temporal Propagation for Fast 3D Human Reconstruction from Monocular Video
A Polarization-aided Transformer for Image Deblurring via Motion Vector Decomposition
Probabilistic Prompt Distribution Learning for Animal Pose Estimation
EventPSR: Surface Normal and Reflectance Estimation from Photometric Stereo Using an Event Camera
$D^{3}$-Human: Dynamic Disentangled Digital Human from Monocular Video
Layered Image Vectorization via Semantic Simplification
4D-Fly: Fast 4D Reconstruction from a Single Monocular Video
Towards Precise Scaling Laws for Video Diffusion Transformers
Sampling Innovation-Based Adaptive Compressive Sensing
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought
Floating No More: Object-Ground Reconstruction from a Single Image
InteractionMap: Improving Online Vectorized HDMap Construction with Interaction
Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding
One for More: Conditinual Diffusion Model for Anomaly Detection
Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers
Interpretable Image Classification via Non-parametric Part Prototype Learning
DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition and Surface Reconstruction for Urban Driving Scenes
Zero-Shot Monocular Scene Flow Estimation in the Wild
FrugalNeRF: Fast Convergence for Extreme Few-shot Novel View Synthesis without Learned Priors
AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting
AirRoom: Objects Matter in Room Reidentification
Generative Photomontage
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
Dynamic Camera Poses and Where to Find Them
Neural Motion Simulator Pushing the Limit of World Models in Reinforcement Learning
ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning
Matrix-Free Shared Intrinsics Bundle Adjustment
MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations
ZeroVO: Visual Odometry with Minimal Assumptions
Sketch Down the FLOPs: Towards Efficient Networks for Human Sketch
CrossOver: 3D Scene Cross-Modal Alignment
MAD: Memory-Augmented Detection of 3D Objects
Scaling Inference Time Compute for Diffusion Models
Improving Personalized Search with Regularized Low-Rank Parameter Updates
RASP: Revisiting 3D Anamorphic Art for Shadow-Guided Packing of Irregular Objects
$\textit{Early-Bird Diffusion}$: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training
Detecting Adversarial Data Using Perturbation Forgery
FreqDebias: Towards Generalizable Deepfake Detection via Consistency-Driven Frequency Debiasing
Spatial-Temporal Visual Representation for Self-Supervised Motion Planning
DELT: A Simple Diversity-driven EarlyLate Training for Dataset Distillation
DexDiffuser: Interaction-aware Diffusion Planning for Adaptive Dexterous Manipulation
Common3D: Self-Supervised Learning of 3D Morphable Models for Common Objects in Neural Feature Space
Interpreting Object-level Foundation Models via Visual Precision Search
MirrorVerse: Pushing Diffusion Models to Realistically Reflect the World
Image is All You Need to Empower Large-scale Diffusion Models for In-Domain Generation
Mamba-Reg: Vision Mamba Also Needs Registers
Learning Partonomic 3D Reconstruction from Image Collections
AdaCM$^2$: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation
What Makes a Good Dataset for Knowledge Distillation?
DH-Set: Improving Vision-Language Alignment with Diverse and Hybrid Set-Embeddings Learning
BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models
Learning from Synchronization: Self-Supervised Uncalibrated Multi-View Person Association in Challenging Scenes
Enhancing 3D Gaze Estimation in the Wild using Weak Supervision with Gaze Following Labels
Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection
MERGE: Multi-faceted Hierarchical Graph-based GNN for Gene Expression Prediction from Whole Slide Histopathology Images
Spherical Manifold Guided Diffusion Model for Panoramic Image Generation
HyperSeg: Hybrid Segmentation Assistant with Fine-grained Visual Perceiver
SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models
ChatHuman: Chatting about 3D Humans with Tools
3D Gaussian Inpainting with Depth-Guided Cross-View Consistency
ChatGarment: Garment Estimation, Generation and Editing via Large Language Models
CSC-PA: Cross-image Semantic Correlation via Prototype Attentions for Single-network Semi-supervised Breast Tumor Segmentation
Exploring Historical Information for RGBE Visual Tracking with Mamba
OpenSDI: Spotting Diffusion-Generated Images in the Open World
Making Old Film Great Again: Degradation-aware State Space Model for Old Film Restoration
Boosting Point-Supervised Temporal Action Localization through Integrating Query Reformation and Optimal Transport
Identity-Clothing Similarity Modeling for Unsupervised Clothing Change Person Re-Identification
HUSH: Holistic Panoramic 3D Scene Understanding using Spherical Harmonics
SfM-Free 3D Gaussian Splatting via Hierarchical Training
Hierarchical Features Matter: A Deep Exploration of Progressive Parameterization Method for Dataset Distillation
GENIUS: A Generative Framework for Universal Multimodal Search
Taste More, Taste Better: Diverse Data and Strong Model Boost Semi-Supervised Crowd Counting
Progress-Aware Video Frame Captioning
Let's Chorus: Partner-aware Hybrid Song-Driven 3D Head Animation
RNG: Relightable Neural Gaussians
Associative Transformer
SGC-Net: Stratified Granular Comparison Network for Open-Vocabulary HOI Detection
Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation
Event Ellipsometer: Event-based Mueller-Matrix Video Imaging
Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification
Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision
Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map
Noise Modeling in One Hour: Minimizing Preparation Efforts for Self-supervised Low-Light RAW Image Denoising
Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking
UrbanCAD: Towards Highly Controllable and Photorealistic 3D Vehicles for Urban Scene Simulation
Towards Generalizable Scene Change Detection
Differentiable Inverse Rendering with Interpretable Basis BRDFs
Large Inverse Rendering Model for Reconstruction of Shape, Materials and Realistic Radiance Field
CheXwhatsApp: A Dataset for Exploring Challenges in the Diagnosis of Chest X-rays through Mobile Devices
DifIISR: Diffusion Model with Gradient Guidance for Infrared Image Super-Resolution
Building Vision Models upon Heat Conduction
5%>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks
Stabilizing and Accelerating Autofocus with Expert Trajectory Regularized Deep Reinforcement Learning
Unlocking the potential of unlabeled data in semi-supervised domain generalization
EdgeTAM: On-Device Track Anything Model
LeanGaussian: Breaking Pixel or Point Cloud Correspondence in Modeling 3D Gaussians
pFedMixF: Personalized Federated Class-Incremental Learning with Mixture of Frequency Aggregation
Doppelgangers++: Improved Visual Disambiguation with Geometric 3D Features
Material Anything: Generating Materials for Any 3D Object via Diffusion
PulseCheck457: A Diagnostic Benchmark for Comprehensive Spatial Reasoning of Large Mutimodal Models
Less is More: Efficient Image Vectorization with Adaptive Parameterization
RaSS: Improving Denoising Diffusion Samplers with Reinforced Active Sampling Scheduler
MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception
SGCR: Spherical Gaussians for Efficient 3D Curve Reconstruction
R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning
KAC: Kolmogorov-Arnold Classifier for Continual Learning
SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2\% Training Cost
Steepest Descent Density Control for Compact 3D Gaussian Splatting
GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks
Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera
Generative Video Propagation
Scaling Properties of Diffusion Models For Perceptual Tasks
Be More Specific: Evaluating Object-centric Realism in Synthetic Images
Olympus: A Universal Task Router for Computer Vision Tasks
ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting
FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution
MetaShadow: Object-Centered Shadow Detection, Removal, and Synthesis.
Movie Weaver: Tuning-Free Multi-Concept Video Personalization with Anchored Prompts
Navigation World Models
PartGen: Part-level 3D Generation and Reconstruction with Multi-view Diffusion Models
HotSpot: Screened Poisson Equation for Signed Distance Function Optimization
M-LLM Based Video Frame Selection for Efficient Video Understanding
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval
Accelerating Multimodel Large Language Models by Searching Optimal Vision Token Reduction
Generative Omnimatte: Learning to Decompose Video into Layers
SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration
Visual Agentic AI for Spatial Reasoning with a Dynamic API
AvatarArtist: Open-Domain 4D Avatarization
Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding
CASAGPT: Cuboid Arrangement and Scene Assembly for Interior Design
OVBench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
Radio Frequency Ray Tracing with Neural Object Representation for Enhanced RF Modeling
A Dataset for Semantic Segmentation in the Presence of Unknowns
Empowering LLMs to Understand and Generate Complex Vector Graphics
Patient-Level Anatomy Meets Scanning-Level Physics: Personalized Federated Low-Dose CT Denoising Empowered by Large Language Model
SpatialDreamer: Self-supervised Stereo Video Synthesis from Monocular Input
InsTaG: Learning Personalized 3D Talking Head from Few-Second Video
CorrBEV:Multi-View 3D Object Detection by Correlation Learning with Multi-modal Prototypes
On Denoising Walking Videos for Gait Recognition
UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection
ResCLIP: Residual Attention for Training-free Dense Vision-language Inference
DRAWER: Digital Reconstruction and Articulation With Environment Realism
Remote Photoplethysmography in Real-World and Extreme Lighting Scenarios
Minding Fuzzy Regions: A Data-driven Alternating Learning Paradigm for Stable Lesion Segmentation
DreamTrack: Dreaming the Future for Multimodal Visual Object Tracking
LoRA Subtraction for Drift-Resistant Space in Exemplar-Free Continual Learning
ALIEN: Implicit Neural Representations for Human Motion Prediction under Arbitrary Latency
Task-driven Image Fusion with Learnable Fusion Loss
Structure-Aware Correspondence Learning for Relative Pose Estimation
HiLoTs: High-Low Temporal Sensitive Representation Learning for Semi-Supervised LiDAR Segmentation in Autonomous Driving
TransPixar: Advancing Text-to-Video Generation with Transparency
Enhanced then Progressive Fusion with View Graph for Multi-View Clustering
IterIS: Iterative Inference-Solving Alignment for LoRA Merging
A Physics-Informed Blur Learning Framework for Imaging Systems
BIGS: Bimanual Category-agnostic Interaction Reconstruction from Monocular Videos via 3D Gaussian Splatting
Free-viewpoint Human Animation with Pose-correlated Reference Selection
SET: Spectral Enhancement for Tiny Object Detection
ModeSeq: Taming Sparse Multimodal Motion Prediction with Sequential Mode Modeling
Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport
Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis and Manipulation
PhysicsGen: Can Generative Models Learn from Images to Predict Complex Physical Relations?
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
HybridGS: Decoupling Transients and Statics with 2D and 3D Gaussian Splatting
V2X-R: Cooperative LiDAR-4D Radar Fusion with Denoising Diffusion for 3D Object Detection
EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis
POT: Prototypical Optimal Transport for Weakly Supervised Semantic Segmentation
Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos
Query Efficient Black-Box Visual Prompting with Subspace Learning
Robust Message Embedding via Attention Flow-Based Steganography
NexusGS: Sparse View Synthesis with Epipolar Depth Priors in 3D Gaussian Splatting
A Focused Human Body Model for Accurate Anthropometric Measurements Extraction
HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation
DVHGNN: Multi-Scale Dilated Vision HGNN for Efficient Vision Recognition
Uni-Renderer: Unifying Rendering and Inverse Rendering Via Dual Stream Diffusion
Attention Distillation: A Unified Approach to Visual Characteristics Transfer
LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding
EMOE: Modality-Specific Enhanced Dynamic Emotion Experts
Align3R: Aligned Monocular Depth Estimation for Dynamic Videos
Uncertain Multimodal Intention and Emotion Understanding in the Wild
Multi-label Prototype Visual Spatial Search for Weakly Supervised Semantic Segmentation
DeformCL: Learning Deformable Centerline Representation for Vessel Extraction in 3D Medical Image
Bundle Sampling: Revisiting Plenoptic Sampling Theory for Efficient Generalizable Neural Radiance Field
Nearly Zero-Cost Protection Against Mimicry by Personalized Diffusion Models
GIVEPose: Gradual Intra-class Variation Elimination for RGB-based Category-Level Object Pose Estimation
Harnessing Global-local Collaborative Adversarial Perturbation for Anti-Customization
Disco4D: Disentangled 4D Human Generation and Animation from a Single Image
Quantization without Tears
SinGS: Animatable Single-Image Human Gaussian Splats with Kinematic Priors
MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation
Locality-Aware Zero-Shot Human-Object Interaction Detection
PoseTraj: Pose-Aware Trajectory Control in Video Diffusion
GoLF-NRT: Integrating Global Context and Local Geometry for Few-Shot View Synthesis
OccMamba: Semantic Occupancy Prediction with State Space Models
ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
Efficient Transfer Learning for Video-language Foundation Models
Revisiting Audio-Visual Segmentation with Vision-Centric Transformer
Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence
Language-Guided Audio-Visual Learning for Long-Term Sports Assessment
VoCo-LLaMA: Towards Vision Compression with Large Language Models
End-to-End HOI Reconstruction Transformer with Graph-based Encoding
Learning a Visual Lexicon from Diffusion Models
Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
Mamba4D: Efficient 4D Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models
Gain from Neighbors: Boosting Model Robustness in the Wild via Adversarial Perturbations Toward Neighboring Classes
No Pains, More Gains: Recycling Sub-Salient Patches for Efficient High-Resolution Image Recognition
VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving
Optimizing for the Shortest Path in Denoising Diffusion Model
Rethinking Personalized Aesthetics Assessment: Employing Physique Aesthetics Assessment as An Exemplification
Video Language Model Pretraining with Spatio-temporal Masking
MetaWriter: Personalized Handwritten Text Recognition Using Meta-Learned Prompt Tuning
ProAPO: Progressively Automatic Prompt Optimization for Visual Classification
RUBIK: A Structured Benchmark for Image Matching across Geometric Challenges
ArtFormer: Controllable Generation of Diverse 3D Articulated Objects
Breaking the Memory Barrier of Contrastive Loss via Tile-Based Strategy
AutoLUT: LUT-Based Image Super-Resolution with Automatic Sampling and Adaptive Residual Learning
Learning Compatible Multi-Prize Subnetworks for Asymmetric Retrieval
S2D-LFE: Sparse-to-Dense Light Field Event Generation
Conditional Balance: Improving Multi-Conditioning Trade-Offs in Image Generation
GAF: Gaussian Avatar Reconstruction from Monocular Videos via Multi-view Diffusion
Rethinking Correspondence-based Category-Level Object Pose Estimation
Pseudo Visible Feature Fine-Grained Fusion for Thermal Object Detection
ArcPro: Architectural Programs for Structured 3D Abstraction of Sparse Points
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding
Hiding Images in Diffusion Models by Editing Learned Score Functions
SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion
EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality
Jailbreaking the Non-Transferable Barrier via Test-Time Data Disguising
Evolving High-Quality Rendering and Reconstruction in a Unified Framework with Contribution-Adaptive Regularization
DocVLM: Make Your VLM an Efficient Reader
Scene-agnostic Pose Regression for Visual Localization
Rethinking Epistemic and Aleatoric Uncertainty for Active Open-Set Annotation: An Energy-Based Approach
MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Genaration
VideoChat-Online: Towards Online Spatial-Temporal Video Understanding via Large Video Language Models
JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration
Stable Flow: Vital Layers for Training-Free Image Editing
Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation
Alias-free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space
Buffer Anytime: Zero-Shot Video Depth and Normal from Single-View Priors
Lifting Motion to the 3D World via 2D Diffusion
Beyond Human Perception: Understanding Multi-Object World from Monocular View
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
GeoDepth: From Point-to-Depth to Plane-to-Depth Modeling for Self-Supervised Monocular Depth Estimation
DeNVeR: Deformable Neural Vessel Representations for Unsupervised Video Vessel Segmentation
Knowledge Bridger: Towards Training-free Missing Multi-modality Completion
A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation
DashGaussian: Optimizing 3D Gaussian Splatting in 200 Seconds
FADE: Frequency-Aware Diffusion Model Factorization for Video Editing
Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition
Advancing Myopia To Holism: Fully Contrastive Language–Image Pre-training
On the Out-Of-Distribution Generalization of Large Multimodal Models
COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts
CADRef: Robust Out-of-Distribution Detection via Class-Aware Decoupled Relative Feature Leveraging
MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos
Learning from Streaming Video with Orthogonal Gradients
Event fields: Capturing light fields at high speed, resolution, and dynamic range
LUMINET: Image-based Indoor Scene Relighting via Latent Intrinsics
3D Prior Is All You Need: Cross-Task Few-shot 2D Gaze Estimation
GCE-Pose: Global Context Enhancement for Category-level Object Pose Estimation
LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions
CoA: Towards Real Image Dehazing via Compression-and-Adaptation
Hash3D: Training-free Acceleration for 3D Generation
Divide and Conquer: Heterogeneous Noise Integration for Diffusion-based Adversarial Purification
Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach
DreamOmni: Unified Image Generation and Editing
Handling Spatial-Temporal Data Heterogeneity for Federated Continual Learning via Tail Anchor
RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion
RORem: Training a Robust Object Remover with Human-in-the-Loop
Star with Bilinear Mapping
Fingerprinting Denoising Diffusion Probabilistic Models
MagicArticulate: Make Your 3D Models Articulation-Ready
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
NightAdapter: Learning a Frequency Adapter for Generalizable Night-time Scene Segmentation
Sufficient Invariant Learning for Distribution Shift
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
AIpparel: A Large Multimodal Generative Model for Digital Garments
R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization
Dense-SfM: Structure from Motion with Dense Consistent Matching
Higher-Order Ratio Cycles for Fast and Globally Optimal Shape Matching
A Simple Data Augmentation for Feature Distribution Skewed Federated Learning
RestorGS: Depth-aware Gaussian Splatting for Efficient 3D Scene Restoration
FOCUS: Knowledge-enhanced Adaptive Visual Compression for Few-shot Whole Slide Image Classification
MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds
CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution
Ref-GS: Modeling View-Dependent Appearance with Environment Gaussian
Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces
Move-in-2D: 2D-Conditioned Human Motion Generation
Dual-Granularity Semantic Guided Sparse Routing Diffusion Model for General Pansharpening
FlashGS: Efficient 3D Gaussian Splatting for Large-scale and High-resolution Rendering
Full-DoF Egomotion Estimation for Event Cameras Using Geometric Solvers
CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image
Anchor-Aware Similarity Cohesion in Target Frames Enables Predicting Temporal Moment Boundaries in 2D
From Prototypes to General Distributions: An Efficient Curriculum for Masked Image Modeling
SceneCrafter: Controllable Multi-View Driving Scene Editing
Active Hyperspectral Imaging Using an Event Camera
Learning with Dynamic Motion Blending for Versatile Motion Editing
Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation
Fast3R: 3D Reconstruction of 1000+ Images in a Single Pass
InterMimic: Towards Learning Universal Human-Object Interaction Skills from Imperfect Motion Capture
UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics
MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval
Self-Evolving Visual Concept Library using Vision-Language Critics
DTOS: Dynamic Time Object Sensing with Multimodal Large Language Model
Neural Inverse Rendering from Propagating Light
Morpheus: Text-Driven 3D Gaussian Splat Shape and Color Stylization
A Unified Image-Dense Annotation Generation Model for Underwater Scenes
RAEncoder: A Label-Free Reversible Adversarial Examples Encoder for Dataset Intellectual Property Protection
AnyMoLe: Any Character Motion In-betweening Leveraging Video Diffusion Models
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
SPC-GS: Gaussian Splatting with Semantic-Prompt Consistency for Indoor Open-World Free-view Synthesis from Sparse Inputs
Sparse Voxels Rasterization: Real-time High-fidelity Radiance Field Rendering
Segment Any-Quality Images with Generative Latent Space Enhancement
AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos
Segment Anything, Even Occluded
ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark
LineArt: A Knowledge-guided Training-free High-quality Appearance Transfer for Design Drawing with Diffusion Model
FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation
OSDFace: One-Step Diffusion Model for Face Restoration
Project-Probe-Aggregate: Efficient Fine-Tuning for Group Robustness
AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction
Adaptive Dropout: Unleashing Dropout across Layers for Generalizable Image Super-Resolution
StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models
Efficient Test-time Adaptive Object Detection via Sensitivity-Guided Pruning
MobileMamba: Lightweight Multi-Receptive Visual Mamba Network
AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion
Channel Consistency Prior and Self-Reconstruction Strategy Based Unsupervised Image Deraining
Glossy Object Reconstruction with Cost-effective Polarized Acquisition
VODiff: Controlling Object Visibility Order in Text-to-Image Generation
Bridging Gait Recognition and Large Language Models Sequence Modeling
InsightEdit: Towards Better Instruction Following for Image Editing
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters
Universal Scene Graph Generation
Sample- and Parameter-Efficient Auto-Regressive Image Models
Point-Cache: Test-time Dynamic and Hierarchical Cache for Robust and Generalizable Point Cloud Analysis
TAET: Two-Stage Adversarial Equalization Training on Long-Tailed Distributions
Link-based Contrastive Learning for One-Shot Unsupervised Domain Adaptation
HeMoRa: Unsupervised Heuristic Consensus Sampling for Robust Point Cloud Registration
Roger: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation
TurboFill: Adapting Few-step Text-to-image Model for Fast Image Inpainting
ICP: Immediate Compensation Pruning for Mid-to-high Sparsity
Preconditioners for the Stochastic Training of Neural Fields
MVBoost: Boost 3D Reconstruction with Multi-View Refinement
Learning to Normalize on the SPD Manifold under Bures-Wasserstein geometry
URWKV: Unified RWKV Model with Multi-state Perspective for Low-light Image Restoration
Bias for Action: Video Implicit Neural Representations with Bias Modulation
Bayesian Test-Time Adaptation for Vision-Language Models
Can Generative Video Models Help Pose Estimation?
iSegMan: Interactive Segment-and-Manipulate 3D Gaussians
StereoAnything: Zero-Shot Stereo Matching
Two by Two: Learning Cross-Task Pairwise Objects Assembly for Generalizable Robot Manipulation
EntitySAM: Segment Everything in Video
GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control
Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning
Probing the Mid-level Vision Capabilities of Self-Supervised Learning
Explain in Diffusion: Explaining a Classifier with Diffusion Semantics
ProtoDepth: Unsupervised Continual Depth Completion with Prototypes
GPVK-VL: Geometry-Preserving Virtual Keyframes for Visual Localization under Large Viewpoint Changes
Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models
MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation
Detail-Preserving Latent Diffusion for Stable Shadow Removal
GEM: A Generalizable Ego-vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control
PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models
VASparse: Towards Efficient Visual Hallucination Mitigation via Visual-Aware Token Sparsification
SUM Parts: Benchmarking Part-Level Semantic Segmentation of Urban Meshes
VIRES: Video Instance Repainting with Sketch and Text Guidance
GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities
Parametric Point Cloud Completion for Polygonal Surface Reconstruction
Test-Time Backdoor Detection for Object Detection Models
Satellite Observations-guided Diffusion Model for Accurate Meteorological States at Arbitrary Resolution
MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders
EVOS: Efficient Implicit Neural Training via EVOlutionary Selector
CoE: Chain-of-Explanation via Automatic Visual Concept Circuit Description and Polysemanticity Quantification
MambaIRv2: Attentive State Space Restoration
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
Do Your Best and Get Enough Rest for Continual Learning
Taming Large Multimodal Agents for Ultra-low Bitrate Semantically Disentangled Image Compression
Dual-Interrelated Diffusion Model for Few-Shot Anomaly Image Generation
PhyS-EdiT: Physics-aware Semantic Image Editing with Text Description
FreeSim: Toward Free-viewpoint Camera Simulation in Driving Scenes
Uncertainty-Instructed Structure Injection for Generalizable HD Map Construction
Segment Any Motion in Videos
Progressive Correspondence Regenerator for Robust 3D Registration
Erase Diffusion: Empowering Object Removal Through Calibrating Diffusion Pathways
VTON-HandFit: Virtual Try-on for Arbitrary Hand Pose Guided by Hand Priors Embedding
Breaking the Low-Rank Dilemma of Linear Attention
UniVAD: A Training-free Unified Model for Few-shot Visual Anomaly Detection
HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation
Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation
Tripartite Weight-Space Ensemble for Few-Shot Class-Incremental Learning
Learning Hazing to Dehazing: Towards Realistic Haze Generation for Real-World Image Dehazing
Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective
WAVE: Weight Templates for Adaptive Initialization of Variable-sized Models
Redefining
in Dictionary: Towards an Enhanced Semantic Understanding of Creative Generation
Learning to Detect Objects from Multi-Agent LiDAR Scans without Manual Labels
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
CamPoint: Boosting Point Cloud Segmentation with Virtual Camera
UniGoal: Towards Universal Zero-shot Goal-oriented Navigation
Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction
MNE-SLAM: Multi-Agent Neural SLAM for Mobile Robots
Symbolic Representation for Any-to-Any Generative Tasks
Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation
3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer
Lux Post Facto: Learning Portrait Performance Relighting with Conditional Video Diffusion and a Hybrid Dataset
Easy-editable Image Vectorization with Multi-layer Multi-scale Distributed Visual Feature Embedding
RGBAvatar: Reduced Gaussian Blendshapes for Head Avatar Animation
Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning
OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels
Vid2Sim: Generalizable, Video-based Reconstruction of Appearance, Geometry and Physics for Mesh-free Simulation
TSP-Mamba: The Travelling Salesman Problem Meets Mamba for Image Super-resolution and Beyond
Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters
Interactive Affordance Learning for Articulated Objects in 3D Environments
TimeTracker: Event-based Continuous Point Tracking for Video Frame Interpolation with Non-linear Motion
MOS: Modeling Object-Scene Associations in Generalized Category Discovery
Interactive Medical Image Segmentation: A Benchmark Dataset and Baseline
Multiple Object Tracking as ID Prediction
IndoorGS: Geometric Cues Guided Gaussian Splatting for Indoor Scene Reconstruction
Bridge Frame and Event: Common Spatiotemporal Fusion for High-Dynamic Scene Optical Flow
AI-Face: A Million-Scale Demographically Annotated AI-Generated Face Dataset and Fairness Benchmark
FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation
DiET-GS: Diffusion Prior and Event Stream-Assisted Motion Deblurring 3D Gaussian Splatting
Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation
Yo’Chameleon: Personalized Vision and Language Generation
DEIM: DETR with Improved Matching for Fast Convergence
Integral Fast Fourier Color Constancy
MegaSynth: Scaling Up 3D Scene Reconstruction with Synthesized Data
Noise-Resistant Video Anomaly Detection via RGB Error-Guided Multiscale Predictive Coding and Dynamic Memory
UVGS: Reimagining Unstructured 3D Gaussian Splatting using UV Mapping
Articulated Motion Distillation from Video Diffusion Models
Vision-Language Models Do Not Understand Negation
Geometry Field Splatting with Gaussian Surfels
RADIO Amplified: Improved Baselines for Agglomerative Vision Foundation Models
Distilling Long-tailed Datasets
g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks
Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning
Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions
Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes
Mamba as a Bridge: Where VFM Meets VLM for Domain-Generalized Semantic Segmentation
Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection
Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment
HELVIPAD: A Real-World Dataset for Omnidirectional Stereo Depth Estimation
Balanced Rate-Distortion Optimization in Learned Image Compression
Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observations
DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding
Diffusion Model is Effectively its Own Teacher
TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization
Navigating Image Restoration with VAR’s Distribution Alignment Prior
CG-IR: Curved Gaussian Splatting for Inverse Rendering
Bridging Past and Future: End-to-End Autonomous Driving with Historical Prediction and Planning
Towards High-fidelity 3D Talking Avatar with Personalized Dynamic Texture
VISTA3D: A Unified Segmentation Foundation Model For 3D Medical Imaging
DiC: Rethinking Conv3x3 Designs in Diffusion Models
CoSER: Towards Consistent Dense Multiview Text-to-Image Generator for 3D Creation
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation
3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations
S$^3$-Face: SSS-Compliant Facial Reflectance Estimation via Diffusion Priors
LOGICZSL: Exploring Logic-induced Representation for Compositional Zero-shot Learning
Temporal Score Analysis for Understanding and Correcting Diffusion Artifacts
Homogeneous Dynamics Space for Heterogeneous Humans
T2SG: Traffic Topology Scene Graph for Topology Reasoning in Autonomous Driving
Boosting Adversarial Transferability through Augmentation in Hypothesis Space
SEEN-DA: SEmantic ENtropy guided Domain-aware Attention for Domain Adaptive Object Detection
Test-Time Visual In-Context Tuning
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning
Horizon-GS: Unified 3D Gaussian Splatting for Large-Scale Aerial-to-Ground Scenes
An End-to-End Robust Point Cloud Semantic Segmentation Network with Single-Step Conditional Diffusion Models
DexGrasp Anything: Towards Universal Robotic Dexterous Grasping with Physics Awareness
IRGS: Inter-Reflective Gaussian Splatting with 2D Gaussian Ray Tracing
TensoFlow: Tensorial Flow-based Sampler for Inverse Rendering
MotionPRO: Exploring the Role of Pressure in Human MoCap and Beyond
Identity-Preserving Text-to-Video Generation by Frequency Decomposition
MBQ: Modality-Balanced Quantization for Large Vision-Language Models
Silence is Golden: Leveraging Adversarial Examples to Nullify Audio Control in LDM-based Talking-Head Generation
Gaussian Splatting Feature Fields for (Privacy-Preserving) Visual Localization
IM-Zero: Instance-level Motion Controllable Video Generation in a Zero-shot Manner
Gromov–Wasserstein Problem with Cyclic Symmetry
FlowRAM: Grounding Flow Matching Policy with Region-Aware Mamba Framework for Robotic Manipulation
Dual Focus-Attention Transformer for Robust Point Cloud Registration
FedSPA: Generalizable Federated Graph Learning under Homophily Heterogeneity
WISNet: Pseudo Label Generation on Unbalanced and Patch Annotated Waste Images
LightLoc: Learning Outdoor LiDAR Localization at Light Speed
TexGarment: Consistent Garment UV Texture Generation via Efficient 3D Structure-Guided Diffusion Transformer
Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning
Extreme Rotation Estimation in the Wild
Consistent Normal Orientation for 3D Point Clouds via Least Squares on Delaunay Graph
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models
All-Optical Nonlinear Diffractive Deep Network for Ultrafast Image Denoising
EnergyMoGen: Compositional Human Motion Generation with Energy-Based Diffusion Model in Latent Space
NoPain: No-box Point Cloud Attack via Optimal Transport Singular Boundary
Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
Learning Dynamic Collaborative Network for Semi-supervised 3D Vessel Segmentation
Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model
Structured 3D Latents for Scalable and Versatile 3D Generation
LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant
Rethinking Query-based Transformer for Continual Image Segmentation
Improved Video VAE for Latent Video Diffusion Model
Compositional Multi-Label Universal Perturbations
DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
Acquire and then Adapt: Squeezing out Text-to-Image Model for Image Restoration
Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing
EditAR: Unified Conditional Generation with Autoregressive Models
RealEdit: Reddit Edits As a Large-scale Empirical Dataset for Image Transformations
COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection
ObjectMover: Generative Object Movement with Video Prior
Zero-shot 3D Question Answering via Voxel-based Dynamic Token Compression
Semantic and Expressive Variations in Image Captions Across Languages
DiSciPLE: Learning Interpretable Programs for Scientific Visual Discovery
BOOTPLACE: Bootstrapped Object Placement with Detection Transformers
The Scene Language: Representing Scenes with Programs, Words, and Embeddings
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
ID-Patch: Robust ID Association for Group Photo Personalization
Cross-modal Information Flow in Multimodal Large Language Models
Mind the Time: Temporally-Controlled Multi-Event Video Generation
SimVS: Simulating World Inconsistencies for Robust View Synthesis
Image Re-ranking with Long-Context Sequence Modeling
GPS as a Control Signal for Image Generation
Do We Always Need the Simplicity Bias? Looking for Optimal Inductive Biases in the Wild
OSV: One Step is Enough for High-Quality Image to Video Generation
Learning Extremely High Density Crowds as Active Matters
3D Student Splatting and Scooping
Data-free Universal Adversarial Perturbation with Pseudo-semantic Prior
Open-World Amodal Appearance Completion
SmartEraser: Remove Anything from Images using Masked-Region Guidance
Sparse2DGS: Geometry-Prioritized Gaussian Splatting for Surface Reconstruction from Sparse Views
Tiled Diffusion
MUSt3R : Multi-view Network for Stereo 3D Reconstruction
Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces
SketchAgent: Language-Driven Sequential Sketch Generation
HyperFree: A Channel-adaptive and Tuning-free Foundation Model for Hyperspectral Remote Sensing Imagery
InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing
M3amba: Memory Mamba is All You Need for Whole Slide Image Classification
Layered motion fusion: Lifting motion segmentation to 3D in egocentric videos
Zero-1-to-A: Zero-shot One image to Animatable Head Avatars using Video Diffusion
UniScene: Unified Occupancy-centric Driving Scene Generation
Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval
Frequency Dynamic Convolution for Dense Image Prediction
No Thing, Nothing: Highlighting Safety-Critical Classes for Robust LiDAR Semantic Segmentation
uCO3D: UnCommon Objects in 3D
PyTorchGeoNodes: Enabling Differentiable Shape Programs for 3D Shape Reconstruction
Continuous Crowd Behavior Generation
Finding Local Diffusion Schrödinger Bridge using Kolmogorov-Arnold Network
LiSu: A Dataset and Method for LiDAR Surface Normal Estimation
MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision
RoboSense: Large-scale Dataset and Benchmark for Egocentric Robot Perception and Navigation in Crowded and Unstructured Environments
Noise-Consistent Siamese-Diffusion for Medical Image Synthesis and Segmentation
Beyond Single-Modal Boundary: Cross-Modal Anomaly Detection through Visual Prototype and Harmonization
DiskVPS: Vanishing Point Detector via Hough Transform in a Disk Region
Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation
Image Generation Diversity Issues and How to Tame Them
EvEnhancer: Empowering Effectiveness, Efficiency and Generalizability for Continuous Space-Time Video Super-Resolution with Events
UniHOPE: A Unified Approach for Hand-Only and Hand-Object Pose Estimation
Federated Semi-Supervised Learning via Pseudo-Correction utilizing Confidence Discrepancy
Efficient Visual State Space Model for Image Deblurring
DistinctAD: Distinctive Audio Description Generation in Contexts
Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis
dFLMoE: Decentralized Federated Learning via Mixture of Experts for Medical Data Analysis
IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation
DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering
Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method
Cross-modal Causal Relation Alignment for Video Question Grounding
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation
LoRACLR: Contrastive Adaptation for Customization of Diffusion Models
PolarFree: Polarization-based Reflection-Free Imaging
GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior
DriveGPT4-V2: Harnessing Large Language Model Capabilities for Enhanced Closed-Loop Autonomous Driving
Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation
HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
Imagine and Seek: Improving Composed Image Retrieval with an Imagined Proxy
FreePCA: Integrating Consistency Information across Long-short Frames in Training-free Long Video Generation via Principal Component Analysis
OCRT: Boosting Foundation Models in the Open World with Object-Concept-Relation Triad
Multi-Scale Neighborhood Occupancy Masked Autoencoder for Self-Supervised Learning in LiDAR Point Clouds
Seeing the Abstract: Translating the Abstract Language for Vision Language Models
Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark
Sonata: Self-Supervised Learning of Reliable Point Representations
Hybrid Explicit Representation for Ultra-Realistic Head Avatars
Omnidirectional Multi-Object Tracking
Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content
Solving Instance Detection from an Open-World Perspective
CLIP is Almost All You Need: Towards Parameter-Efficient Scene Text Retrieval without OCR
Free-form Generation Enhances Challenging Clothed Human Modeling
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
Real-Time Multi-Person 3D Mesh Estimation via Scale-Adaptive Tokens
LiDAR-RT: Gaussian-based Ray Tracing for Dynamic LiDAR Re-simulation
PEACE: Empowering Geologic Map Holistic Understanding with MLLMs
MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection Dataset for Tiny Objects
Few-Shot Recognition via Stage-wise Retrieval-Augmented Finetuning
Arbitrary-steps Image Super-resolution via Diffusion Inversion
PIDSR: Complementary Polarized Image Demosaicing and Super-Resolution
DreamText: High Fidelity Scene Text Synthesis
Cross-Modal 3D Representation with Multi-View Images and Point Clouds
Community Forensics: Using Thousands of Generators to Train Fake Image Detectors
Towards Realistic Example-based Modeling via 3D Gaussian Stitching
Enhanced Visual-Semantic Interaction with Tailored Prompts for Pedestrian Attribute Recognition
Video-Guided Foley Sound Generation with Multimodal Controls
MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention
Motions as Queries: One-Stage Multi-Person Holistic Human Motion Capture
Exploring Timeline Control for Facial Motion Generation
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning
Egocentric Whole-Body Motion Diffusion with Exemplar-Based Identity Conditioning
TCFG: Tangential Damping Classifier-free Guidance
DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles
SCAP: Transductive Test-Time Adaptation via Supportive Clique-based Attribute Prompting
LUCAS: Layered Universal Codec Avatars
ROICtrl: Boosting Instance Control for Visual Generation
Magma: A Foundation Model for Multimodal AI Agents
PanSplat: 4K Panorama Synthesis with Feed-Forward Gaussian Splatting
DPFlow: Adaptive Optical Flow Estimation with a Dual-Pyramid Framework
MAGE : Single Image to Material-Aware 3D via the Multi-View G-Buffer Estimation Model
Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents
Robust Multi-Object 4D Generation for In-the-wild Videos
WonderWorld: Interactive 3D Scene Generation from a Single Image
A Unified Model for Compressed Sensing MRI Across Undersampling Patterns
CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning
Reconstructing Humans with a Biomechanically Accurate Skeleton
Show and Segment: Universal Medical Image Segmentation via In-Context Learning
Scaling Vision Pre-Training to 4K Resolution
Category-Agnostic Neural Object Rigging
ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation
Zero-Shot Novel View and Depth Synthesis with Multi-View Geometric Diffusion
Incremental Object Keypoint Learning
Conical Visual Concentration for Efficient Large Vision-Language Models
ZeroGrasp: Zero-Shot Shape Reconstruction Enabled Robotic Grasping
HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models
Motion Modes: What Happens Next?
Hand-held Object Reconstruction from RGB Video with Dynamic Interaction
MultiMorph: On-demand Atlas Construction
Decoupled Distillation to Erase: A General Unlearning Method for Any Class-centric Tasks
Unraveling Normal Anatomy via Fluid-Driven Anomaly Randomization
Unlocking Video-LLM via Agent-of-Thoughts Distillation
FG$^2$: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching
AA-CLIP: Enhancing Zero-shot Anomaly Detection via Anomaly-Aware CLIP
MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining
SimLingo: Vision-only Closed-Loop Autonomous Driving with Grounded Language Understanding
LiVOS: Light Video Object Segmentation with Gated Linear Matching
IMFine: 3D Inpainting via Geometry-guided Multi-view Refinement
AKiRa: Augmentation Kit on Rays for optical video generation
MotionMap: Representing Multimodality in Human Pose Forecasting
Quaffure: Real-Time Quasi-Static Neural Hair Simulation
MatAnyone: Stable Video Matting with Consistent Memory Propagation
The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation
Zero-Shot 4D Lidar Panoptic Segmentation
Vid2Avatar-Pro: Authentic Avatar from Videos in the Wild via Universal Prior
VideoRepainter: Creative Video Inpainting with Keyframe Reference
JamMa: Ultra-lightweight Local Feature Matching with Joint Mamba
Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding
Adaptive Part Learning for Fine-Grained Generalized Category Discovery: A Plug-and-Play Enhancement
Take the Bull by the Horns: Learning to Segment Hard Samples
_x0008_APT: Adaptive Personalized Training for Diffusion Models with Limited Data
Improving Visual and Downstream Performance of Low-Light Enhancer with Vision Foundation Models Collaboration
STiL: Semi-supervised Tabular-Image Learning for Comprehensive Task-Relevant Information Exploration in Multimodal Classification
Unbiased Video Scene Graph Generation via Visual and Semantic Dual Debiasing
Distilling Monocular Foundation Model for Fine-grained Depth Completion
Understanding multi-layered transmission matrices
Pattern Analogies: Learning to Perform Programmatic Image Edits by Analogy
HeatFormer: A Neural Optimizer for Multiview Human Mesh Recovery
World-consistent Video Diffusion with Explicit 3D Modeling
PERSE: Personalized 3D Generative Avatars from A Single Portrait
WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model
Training-free Neural Architecture Search through Variance of Knowledge of Deep Network Weights
GaPT-DAR: Category-level Garments Pose Tracking via Integrated 2D Deformation and 3D Reconstruction
EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
Panorama Generation From NFoV Image Done Right
Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D Data
ParaHome: Parameterizing Everyday Home Activities Towards 3D Generative Modeling of Human-Object Interactions
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE
Scene4U: Hierarchical Layered 3D Scene Reconstruction from Single Panoramic Image for Your Immerse Exploration
TinyFusion: Diffusion Transformers Learned Shallow
Forming Auxiliary High-confident Instance-level Loss to Promote Learning from Label Proportions
NGV: Neural Gaussian Velocity for 3D Physics Modeling from Dynamic Videos
Thin-Shell-SfT: Fine-Grained Monocular Non-rigid 3D Surface Tracking with Neural Deformation Fields
Unbiased General Annotated Dataset Generation
A4A: Adapter for Adapter Transfer via All-for-All Mapping for Cross-Architecture Models
MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Anticipation
LITA-GS: Illumination-Agnostic Novel View Synthesis via Reference-Free 3D Gaussian Splatting and Physical Priors
LLM-driven Multimodal and Multi-Identity Listening Head Generation
OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking
MODfinity Unsupervised Domain Adaptation with Multimodal Information Flow Intertwining
Temporal Action Detection Model Compression by Progressive Block Drop
Test-Time Domain Generalization via Universe Learning: A Multi-Graph Matching Approach for Medical Image Segmentation
DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models
D$^2$iT: Dynamic Diffusion Transformer for Accurate Image Generation
AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing
FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute
Boosting Domain Incremental Learning: Selecting the Optimal Parameters is All You Need
VTON 360: High-Fidelity Virtual Try-On from Any Viewing Direction
SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking
AdaptCMVC: Robust Adaption to Incremental Views in Continual Multi-view Clustering
Hybrid Concept Bottleneck Models
GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill
Z-Magic: Zero-shot Multiple Attributes Guided Image Creator
QuCOOP: A Versatile Framework for Solving Composite and Binary-Parametrised Problems on Quantum Annealers
Erasing Undesirable Influence in Diffusion Models
Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors
Unlocking Generalization Power in LiDAR Point Cloud Registration
Adversarial Domain Prompt Tuning and Generation for Single Domain Generalization
Instant3dit: Multiview Inpainting for Fast Editing of 3D Objects
Rectification-specific Supervision and Constrained Estimator for Online Stereo Rectification
LaVin-DiT: Large Vision Diffusion Transformer
DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection
Mimir: Improving Video Diffusion Models for Precise Text Understanding
Exploring Contextual Attribute Density in Referring Expression Counting
Functionality understanding and segmentation in 3D scenes
3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion
Effective Cloud Removal for Remote Sensing Images by an Improved Mean-Reverting Denoising Model with Elucidated Design Space
GraphI2P: Image-to-Point Cloud Registration with Exploring Pattern of Correspondence via Graph Learning
Hierarchical Flow Diffusion for Efficient Frame Interpolation
ISMimic: Learning Basketball Interaction Skills from Demonstrations
The Art of Deception: Color Visual Illusions and Diffusion Models
Design2GarmentCode: Turning Design Concepts to Tangible Garments Through Program Synthesis
Open-Canopy: Towards Very High Resolution Forest Monitoring
Towards Universal Soccer Video Understanding
SeqMvRL: A Sequential Fusion Framework for Multi-view Representation Learning
EasyCraft: A Robust and Efficient Framework for Automatic Avatar Crafting
Perceptive 3D language assistant
FASTer: Focal token Acquiring-and-Scaling Transformer for Long-term 3D Objection Detection.
SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving
From Sparse Signal to Smooth Motion: Real-Time Motion Generation with Rolling Prediction Models
FisherTune: Fisher-Guided Robust Tuning of Vision Foundation Models for Domain Generalized Segmentation
MAGiC-SLAM: Multi-Agent Gaussian Globally Consistent SLAM
Degradation-Aware Feature Perturbation for All-in-One Image Restoration
Birth and Death of a Rose
StyleMaster: Stylize Your Video with Artistic Generation and Translation
SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling
One-way ticket: Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models
MMRL: Multi-Modal Representation Learning for Vision-Language Models
Ferret: An Efficient Online Continual Learning Framework under Varying Memory Constraints
Video Depth without Video Models
RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation
OPTICAL: Leveraging Optimal Transport for Contribution Allocation in Dataset Distillation
Empowering Large Language Models with 3D Situation Awareness
Generative Gaussian Splatting for Unbounded 3D City Generation
Multi-view Reconstruction via SfM-guided Monocular Depth Estimation
UHD-processer: Unified UHD Image Restoration with Progressive Frequency Learning and Degradation-aware Prompts
GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency
DeRS: Towards Extremely Efficient Upcycled Mixture-of-Experts Models
CCIN: Compositional Conflict Identification and Neutralization for Composed Image Retrieval
Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation
Dinomaly: The $Less~Is~More$ Philosophy in Multi-Class Unsupervised Anomaly Detection
Toward Generalized Image Quality Assessment: Relaxing the Perfect Reference Quality Assumption
Comprehensive Relighting: Generalizable and Consistent Monocular Human Relighting and Harmonization
POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation
Correlative and Discriminative Label Grouping for Multi-Label VPT
DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation
Visioner: Exploring Knowledge Learning from Raw Videos
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for accelerating Large VLMs
Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning
AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers
MVPaint: 3D Texture Generation with Multi-View Consistency
TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion
Stacking Brick by Brick: Aligned Feature Isolation for Incremental Face Forgery Detection
Blind Bitstream-corrupted Video Recovery via Metadata-guided Diffusion Model
Visual Prompting for One-shot Controllable Video Editing without Inversion
ScaleLSD: Scalable Deep Line Segment Detection Streamlined
ARM: Appearance Reconstruction Model for Relightable 3D Generation
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
EgoLife: Towards Egocentric Life Assistant
Distribution Prototype Diffusion Learning for Open-set Supervised Anomaly Detection
M3GYM: A Large-Scale Multimodal Multi-view Multi-person Pose Dataset for Fitness Activity Understanding in Real-world Settings
CORE4D: A 4D Human-Object-Human Interaction Dataset for Collaborative Object REarrangement
PlanarSplatting: Accurate Planar Surface Reconstruction in 3 Minutes
F-LMM: Grounding Frozen Large Multimodal Models
LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes
MotiF: Making Text Count in Image Animation with Motion Focal Loss
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Implicit Correspondence Learning for Image-to-Point Cloud Registration
EventFly: Event Camera Perception from Ground to the Sky
Luminance-GS: Adapting 3D Gaussian Splatting to Challenging Lighting Conditions with View-Adaptive Curve Adjustment
Shape My Moves: Text-Driven Shape-Aware Synthesis of Human Motions
Seeing is Not Believing: Adversarial Natural Object Optimization for Hard-Label 3D Scene Attacks
DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation
AniMo: Species-aware Model for Text-driven Animal Motion Generation
Learning Visual Generative Priors without Text
MeGA: Hybrid Mesh-Gaussian Head Avatar for High-Fidelity Rendering and Head Editing
Dynamic Integration of Task-Specific Adapters for Class Incremental Learning
ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding
OSMamba: Omnidirectional Spectral Mamba with Dual-Domain Prior Generator for Exposure Correction
Less Attention is More: Prompt Transformer for Generalized Category Discovery
Harnessing Frequency Spectrum Insights for Image Copyright Protection Against Diffusion Models
Adversarial Diffusion Compression for Real-World Image Super-Resolution
Towards Universal Dataset Distillation via Task-Driven Diffusion
Automated Proof of Polynomial Inequalities via Reinforcement Learning
DAGSM: Disentangled Avatar Generation with GS-enhanced Mesh
Bridge the Gap: From Weak to Full Supervision for Temporal Action Localization with PseudoFormer
PromptHMR: Promptable Human Mesh Recovery
Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation
HyperNet Fields: Efficiently Training Hypernetworks without Ground Truth by Learning Weight Trajectories
GazeGene: Large-scale Synthetic Gaze Dataset with 3D Eyeball Annotations
UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting
DoF-Gaussian: Controllable Depth-of-Field for 3D Gaussian Splatting
Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation
WildAvatar: Learning In-the-wild 3D Avatars from the Web
CH$_3$Depth: Efficient and Flexible Depth Foundation Model with Flow Matching
Towards RAW Object Detection in Diverse Conditions
Context-Enhanced Memory-Refined Transformer for Online Action Detection
Compass Control: Multi-Object Orientation Control for Text-to-Image Generation
SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection
Masking meets Supervision: A Strong Learning Alliance
Improving the Transferability of Adversarial Attacks on Face Recognition with Diverse Parameters Augmentation
ORIDa: Object-centric Real-world Image Composition Dataset
Latent space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models
On the Consistency of Video Large Language Models in Temporal Comprehension
Apollo: An Exploration of Video Understanding in Large Multi-Modal Models
Distilling Multi-modal Large Language Models for Autonomous Driving
Diffusion Self-Distillation for Zero-Shot Customized Image Generation
X-Dyna: Expressive Dynamic Human Image Animation
InteractVLM: 3D Interaction Reasoning from 2D Foundational Models
Prompt-CAM: Prompt-Class Attention Map for Fine-grained Interpretation
Transfer Your Perspective: Controllable 3D Generation from Any Viewpoint in a Driving Scene
4Deform: Neural Surface Deformation for Robust Shape Interpolation
Heterogeneous Teacher Distillation
EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision
Recreating 1940s Tom and Jerry with Test-Time Training
Odd-One-Out: Anomaly Detection by Comparing with Neighbors
MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views
AnySat: An Earth Observation Model for Any Modalities, Resolutions, and Scales
Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification
SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement
The Power of Context: How Multimodality Improves Image Super-Resolution
MARBLE: Material Recomposition and Blending in CLIP-Space
SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Towards Autonomous Micromobility through Scalable Urban Simulation
Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation
Adaptive Non-Uniform Timestep Sampling for Diffusion Model Training
Sound Bridge: Association Egocentric and Exocentric Videos via Audio Cues
Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization
SimLTD: Simple Semi-Supervised Long-Tailed Object Detection
Do computer vision foundation models learn the low-level characteristics of the human visual system?
Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera
Reasoning to Attend: Try to Understand How [SEG] Token Works
Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models?
VasTSD: Learning 3D Vascular Tree-state Space Diffusion Model for Angiography Synthesis
UniNet: A Contrastive Learning-guided Unified Framework with Feature Selection for Anomaly Detection
Conformal Prediction for Zero-Shot Models
Minority-Focused Text-to-Image Generation via Prompt Optimization
BADGR: Bundle Adjustment Diffusion Conditioned by Gradients for Wide-Baseline Floor Plan Reconstruction
Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning
Trajectory-Mamba: An Efficient Attention-Mamba Forecasting Model Based on Selective SSM
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
Twinner: Shining Light on Digital Twins in a Few Snaps
Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction
Diffusion-based Event Generation for High-Quality Image Deblurring
SACB-Net: Spatial-awareness Convolutions for Medical Image Registration
VoxelSplat: Dynamic Gaussian Splatting as an Effective Loss for Occupancy and Flow Prediction
Enhancing Few-Shot Class-Incremental Learning via Training-Free Bi-Level Modality Calibration
CleanDIFT: Diffusion Features without Noise
PS-EIP: Robust Photometric Stereo Based on Event Interval Profile
MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism
Completion as Enhancement: A Degradation-Aware Selective Image Guided Network for Depth Completion
Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models
Any-Resolution AI-Generated Image Detection by Spectral Learning
Learned Image Compression with Dictionary-based Entropy Model
DriveScape: High-Resolution Driving Video Generation by Multi-View Feature Fusion
Interleaved-modal Chain-of-Thought
ICE: Intrinsic Concept Extraction From a Single Image via Diffusion Models
Tokenize Image Patches: Global Context Fusion for Effective Haze Removal in Large Images
Accurate Differential Operators for Hybrid Neural Fields
3D-HGS: 3D Half-Gaussian Splatting
Video Summarization with Large Language Models
RoomPainter: View-Integrated Diffusion for Consistent Indoor Scene Texturing
PLeaS - Merging Models with Permutations and Least Squares
Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion
GaussianSpa: An “Optimizing-Sparsifying” Simplification Framework for Compact and High-Quality 3D Gaussian Splatting
MaskGaussian: Adaptive 3D Gaussian Representation from Probabilistic Masks
Words or Vision: Do Vision-Language Models Have Blind Faith in Text?
Generative Map Priors for Collaborative BEV Semantic Segmentation
HandOS: 3D Hand Reconstruction in One Stage
One-Step Event-Driven High-Speed Autofocus
Classic Video Denoising in a Machine Learning World: Robust, Fast, and Controllable
From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration
UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference Image
VGGN: Visual Geometry Grounded Network
Distraction is All You Need for Multimodal Large Language Model Jailbreaking
GG-SSMs: Graph-Generating State Space Models
Task-Specific Gradient Adaptation for Few-Shot One-Class Classification
4DTAM: Non-Rigid Tracking and Mapping via Dynamic Surface Gaussians
DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Video
Multi-party Collaborative Attention Control for Image Customization
PUP 3D-GS: Principled Uncertainty Pruning for 3D Gaussian Splatting
Docopilot: Improving Multimodal Models for Document-Level Understanding
How to Merge Your Multimodal Models Over Time?
Localizing Events in Videos with Multimodal Queries
BiLoRA: Almost-Orthogonal Parameter Spaces for Continual Learning
VLMs-Guided Representation Distillation for Efficient Vision-Based Reinforcement Learning
DepthSplat: Connecting Gaussian Splatting and Depth
Minimal Interaction Seperated Tuning: A New Paradigm for Visual Adaptation
Context-Aware Multimodal Pretraining
Generative Image Layer Decomposition with Visual Effects
Advancing Adversarial Robustness in GNeRFs: The IL2-NeRF Attack
Guiding Human-Object Interactions with Rich Geometry and Relations
SCFlow2: Plug-and-Play Object Pose Refiner with Shape-Constraint Scene Flow
Fractal Calibration for long-tailed object detection
Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation
PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution
SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model
AffordDP: Generalizable Diffusion Policy with Transferable Affordance
Towards Generalizable Trajectory Prediction using Dual-Level Representation Learning and Adaptive Prompting
Alignment, Mining and Fusion: Representation Alignment with Hard Negative Mining and Selective Knowledge Fusion for Medical Visual Question Answering
Hierarchical Knowledge Prompt Tuning for Multi-task Test-Time Adaptation
From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons
MaSS13K: A Matting-level Semantic Segmentation Benchmark
The Photographer's Eye: Teaching Multimodal Large Language Models to See, Think and Critique Like Photographers
LaTexBlend: Scaling Multi-concept Customized Generation with Latent Textual Blending
Show and Tell: Visually Explainable Deep Neural Nets via Spatially-Aware Concept Bottleneck Models
CaMuViD: Calibration-Free Multi-View Detection
Reversible Decoupling Network for Single Image Reflection Removal
Reconciling Stochastic and Deterministic Strategies for Zero-shot Image Restoration using Diffusion Model in Dual
PatchVSR: Breaking Video Diffusion Resolution Limits with Patch-wise Video Super-Resolution
DynPose: Largely Improving the Efficiency of Human Pose Estimation by a Simple Dynamic Framework
InterDyn: Controllable Interactive Dynamics with Video Diffusion Models
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity
Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language
D2SP: Dynamic Dual-Stage Purification Framework for Dual Noise Mitigation in Vision-based Affective Recognition.
BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations
How Do I Do That? Synthesizing 3D Hand Motion and Contacts for Everyday Interactions
TIMotion: Temporal and Interactive Framework for Efficient Human-Human Motion Generation
Synthetic Visual Genome
FALCON: Fairness Learning via Contrastive Attention Approach to Continual Semantic Scene Understanding
PatchGuard: Adversarially Robust Anomaly Detection and Localization through Vision Transformers and Pseudo Anomalies
MOS-Attack: A Scalable Multi-objective Adversarial Attack Framework
MaIR: A Locality- and Continuity-Preserving Mamba for Image Restoration
BooW-VTON: Boosting In-the-Wild Virtual Try-On via Mask-Free Pseudo Data Training
Hearing Anywhere in Any Environment
Sonic: Shifting Focus to Global Audio Perception in Portrait Animation
Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents
Graph-Embedded Structure-Aware Perceptual Hashing for Neural Network Protection and Piracy Detection
RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings
Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-invariant Representation Learning
Joint Out-of-Distribution Filtering and Data Discovery Active Learning
Leveraging Perturbation Robustness to Enhance Out-of-Distribution Detection
Gazing Into Missteps: Leveraging Eye-Gaze for Unsupervised Mistake Detection in Egocentric Videos of Skilled Human Activities
RAD: Region-Aware Diffusion Models for Image Inpainting
GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding
Point-to-Region Loss for Semi-Supervised Point-Based Crowd Counting
VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge
On the Zero-shot Adversarial Robustness of Vision-Language Models: A Truly Zero-shot and Training-free Approach
FIFA: Fine-grained Inter-frame Attention for Driver's Video Gaze Estimation
Foveated Instance Segmentation
Co-Speech Gesture Video Generation with Implicit Motion-Audio Entanglement
Learned Binocular-Encoding Optics for RGBD Imaging Using Joint Stereo and Focus Cues
BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing
DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels
DKC: Differentiated Knowledge Consolidation for Cloth-Hybrid Lifelong Person Re-identification
GOAL: Global-local Object Alignment Learning
Towards Open-Vocabulary Audio-Visual Event Localization
DiffCAM: Data-Driven Saliency Maps by Capturing Feature Differences
Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding
BFANet: Revisiting 3D Semantic Segmentation with Boundary Feature Analysis
NLPrompt: Noise-Label Prompt Learning for Vision-Language Models
Curriculum Direct Preference Optimization for Diffusion and Consistency Models
TexGaussian: Generating High-quality PBR Material via Octree-based 3D Gaussian Splatting
PHGC: Procedural Heterogeneous Graph Completion for Natural Language Task Verification in Egocentric Videos
Efficient Diffusion as Low Light Enhancer
Creating Your Editable 3D Photorealistic Avatar with Tetrahedron-constrained Gaussian Splatting
Navigating the Unseen: Zero-shot Scene Graph Generation via Capsule-Based Equivariant Features
RayFlow: Instance-Aware Diffusion Acceleration via Adaptive Flow Trajectories
PO3AD: Predicting Point Offsets toward Better 3D Point Cloud Anomaly Detection
Sea-ing in Low-light
Neural Hierarchical Decomposition for Single Image Plant Modeling
Toward Real-world BEV Perception: Depth Uncertainty Estimation via Gaussian Splatting
EDM: Equirectangular Projection-Oriented Dense Kernelized Feature Matching
SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model
Audio-Visual Instance Segmentation
Flexible Selection for Efficient Video Reasoning
DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture
STEP: Enhancing Video-LLMs’ Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training
NN-Former: Rethinking Graph Structure in Neural Architecture Representation
SynTab-LLaVA: Enhancing Multimodal Table Understanding with Decoupled Synthesis
ControlFace: Harnessing Facial Parametric Control for Face Rigging
ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis
Understanding Multi-Task Activities from Single-Task Videos
Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers
A Theory of Learning Unified Model via Knowledge Integration from Label Space Varying Domains
SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving
Enhancing Diversity for Data-free Quantization
Few-shot Implicit Function Generation via Equivariance
See Further When Clear: Curriculum Consistency Model
Crafting a Miniature Interactive World from a Single Image
PoseBH: Propotypical Multi-Dataset Training Beyond Human Pose Estimation
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
GLASS: Guided Latent Slot Diffusion for Object-Centric Learning
Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks
Learning-enabled Polynomial Lyapunov Function Synthesis via High-Accuracy Counterexample-Guided Framework
Learning Bijective Surface Parameterization for Inferring Signed Distance Functions from Sparse Point Clouds with Grid Deformation
Mosaic of Modalities: A Comprehensive Benchmark for Multimodal Graph Learning
Meta-Learning Hyperparameters for Foundation Model Adaptation in Remote-Sensing Imagery
ArtiFade: Learning to Generate High-quality Subject from Blemished Image
A Simple yet Effective Layout Token in Large Language Models for Document Understanding
Vision-Language Model IP Protection via Prompt-based Learning
Efficient Motion-Aware Video MLLM
Gaussian Splashing: Unified Particles for Versatile Motion Synthesis and Rendering
Bridging the Vision-Brain Gap with an Uncertainty-Aware Blur Prior
MambaVO: Deep Visual Odometry by Sequential Matching Refinement and Training Smoothing
Unified Uncertainty-Aware Diffusion for Multi-Agent Trajectory Modeling
Enhancing Virtual Try-On with Synthetic Pairs and Error-Aware Noise Scheduling
MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation
CASP: Compression of Large Multimodal Models Based on Attention Sparsity
PMNI: Pose-free Multi-view Normal Integration for Reflective and Textureless Surface Reconstruction
PURA: Parameter Update-Recovery Test-Time Adaption for RGB-T Tracking
Instant Gaussian Stream: Fast and Generalizable Streaming of Dynamic Scene Reconstruction via Gaussian Splatting
Towards Effective and Sparse Adversarial Attack on Spiking Neural Networks via Breaking Invisible Surrogate Gradients
InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment
Retrieval-Augmented Personalization for Multimodal Large Language Models
Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization
GO-N3RDet: Geometry Optimized NeRF-enhanced 3D Object Detector
Structure-from-Motion with a Non-Parametric Camera Model
CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models
VoteFlow: Enforcing Local Rigidity in Self-Supervised Scene Flow
Unified Dense Prediction of Video Diffusion
ONDA-Pose: Occlusion-Aware Neural Domain Adaptation for Self-Supervised 6D Object Pose Estimation
Learning Textual Prompts for Open-World Semi-Supervised Learning
Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content
Reducing Class-wise Confusion for Incremental Learning with Disentangled Manifolds
VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
Dual Prompting for Image Restoration across Full-Scene with Diffusion Transformers
PolarNeXt: Rethink Instance Segmentation with Polar Representation
RipVIS: Rip Currents Video Instance Segmentation Benchmark for Beach Monitoring and Safety
CustomKD: Customizing Large Vision Foundation for Edge Model Improvement via Knowledge Distillation
Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget
Doppelgängers and Adversarial Vulnerability
iG-6DoF: Model-free 6DoF Pose Estimation for Unseen Object via Iterative 3D Gaussian Splatting
Efficient Video Super-Resolution for Real-time Rendering with Decoupled G-buffer Guidance
Human-Aligned Video Generation Benchmark
Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery
Evaluating Vision-Language Models as Evaluators in Path Planning
Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding
Gyro-based Neural Single Image Deblurring
CLOC: Contrastive Learning for Ordinal Classification with Multi-Margin N-pair Loss
Task Singular Vectors: Reducing Task Interference in Model Merging
ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos
ProbeSDF: Light Field Probes For Neural Surface Reconstruction
RobSense: A Robust Multi-modal Foundation Model for Remote Sensing with Static, Temporal, and Incomplete Data Adaptability
Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attack on Breast Ultrasound Images
CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation
DOF-GS: Adjustable Depth-of-Field 3D Gaussian Splatting for Post-Capture Refocusing, Defocus Rendering and Blur Removal
Keep the Balance: A Parameter-Efficient Symmetrical Framework for RGB+X Semantic Segmentation
VITED: Video Temporal Evidence Distillation
LotusFilter: Fast Diverse Nearest Neighbor Search via a Learned Cutoff Table
Ref-GS: Directional Factorization for 2D Gaussian Splatting
Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction
MVDoppler-Pose: Multi-Modal Multi-View mmWave Sensing for Long-Distance Self-Occluded Human Walking Pose Estimation
FSHNet: Fully Sparse Hybrid Network for 3D Object Detection
ERUPT: Efficient Rendering with Unposed Patch Transformer
DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification
Unveiling Visual Perception in Language Models: A Attention Head Analysis Approach
Think Small, Act Big: Primitive Prompt Learning for Lifelong Robot Manipulation
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models
Efficient Personalization of Quantized Diffusion Model without Backpropagation
RICCARDO: Radar Hits Prediction and Convolution for Target Detection with Radar-Camera Fusion
Object Detection using Event Camera: A MoE Heat Conduction based Detector and A New Benchmark Dataset
ChainHOI: Joint-based Kinematic Chain Modeling for Human-Object Interaction Generation
Touch2Shape: Touch-Conditioned 3D Diffusion for Shape Exploration and Reconstruction
NVILA: Efficient Frontier Visual Language Models
Advancing Multiple Instance Learning with Continual Learning for Whole Slide Imaging
SEC-Prompt:SEmantic Complementary Prompting for Few-Shot Class-Incremental Learning
PanoGS: Gaussian-based Panoptic Segmentation for 3D Open Vocabulary Scene Understanding
Interpretable Generative Models through Post-hoc Concept Bottlenecks
Advancing Manga Analysis: Comprehensive Segmentation Annotations for the Manga109 Dataset
Distilling Spatially-Heterogeneous Distortion Perception for Blind Image Quality Assessment
Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
Light3R-SfM: Towards Feed-forward Structure-from-motion
Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation
Sketchy Bounding-box Supervision for 3D Instance Segmentation
PSBD: Prediction Shift Uncertainty Unlocks Backdoor Detection
Latent Space Imaging
DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis
SLADE: Shielding against Dual Exploits in Large Vision-Language Models
LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
Dynamic Neural Surfaces for Elastic 4D Shape Representation and Analysis
Dense Dispersed Structured Light for Hyperspectral 3D Imaging of Dynamic Scenes
Beyond Background Shift: Rethinking Instance Replay in Continual Semantic Segmentation
3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination
A Flag Decomposition for Hierarchical Datasets
Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt learning
Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model
MAC-Ego3D: Multi-Agent Gaussian Consensus for Real-Time Collaborative Ego-Motion and Photorealistic 3D Reconstruction
Polarized Color Screen Matting
Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation
Sensitivity-Aware Efficient Fine-Tuning via Compact Dynamic-Rank Adaptation
Shading Meets Motion: Self-supervised Indoor 3D Reconstruction Via Simultaneous Shape-from-Shading and Structure-from-Motion
Multi-modal Medical Diagnosis via Large-small Model Collaboration
AnyMap: Learning a General Camera Model for Structure-from-Motion with Unknown Distortion in Dynamic Scenes
FATE: Full-head Gaussian Avatar with Textural Editing from Monocular Video
LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models
TSD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution
Discovering Fine-Grained Visual-Concept Relations by Disentangled Optimal Transport Concept Bottleneck Models
DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos
Blurred LiDAR for Sharper 3D: Robust Handheld 3D Scanning with Diffuse LiDAR and RGB
Scaling up Image Segmentation across Data and Tasks
RivuletMLP: An MLP-based Architecture for Efficient Compressed Video Quality Enhancement
HistoFS: Non-IID Histopathologic Whole Slide Image Classification via Federated Style Transfer with RoI-Preserving
StyleSSP: Sampling StartPoint Enhancement for Training-free Diffusion-based Method for Style Transfer
From Head to Tail: Efficient Black-box Model Inversion Attack via Long-tailed Learning
3D Gaussian Head Avatars with Expressive Dynamic Appearances by Compact Tensorial Representations
TopNet: Transformer-Efficient Occupancy Prediction Network for Octree-Structured Point Cloud Geometry Compression
Physical Plausibility-aware Trajectory Prediction via Locomotion Embodiment
Multi-Sensor Object Anomaly Detection: Unifying Appearance, Geometry, and Internal Properties
Fast Convergence of Diffusion Transformers in a Better High-Dimensional Latent Space
Detecting Backdoor Attacks in Federated Learning via Direction Alignment Inspection
Advancing Fine-Grained Compositional Alignment in Video-Text Models
CARL: A Framework for Equivariant Image Registration
Detecting Out-of-Distribution through the Lens of Neural Collapse
Empowering Vector Graphics with Consistently Arbitrary Viewing and View-dependent Visibility
VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation
CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos
MotionPro: A Precise Motion Controller for Image-to-Video Generation
Focal Split: Untethered Snapshot Depth from Differential Defocus
LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis
RELOCATE: A Simple Training-Free Baseline for Visual Query Localization Using Region-Based Representations
EgoLM: Multi-Modal Language Model of Egocentric Motions
Potts Relaxations and Soft Self-labeling for Weakly-supervised Segmentation
Parameterized Blur Kernel Prior Learning for Local Motion Deblurring
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
Split Adaptation for Pre-trained Vision Transformers
SSHNet: Unsupervised Cross-modal Homography Estimation via Problem Redefinition and Split Optimization
ADU: Adaptive Detection of Unknown Categories in Black-Box Domain Adaptation
PromptHash:Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing Retrieval
DiffLocks: Reconstructing 3D Hair from a Single Image using Diffusion Models
LAL: Enhancing 3D Human Motion Prediction with Latency-aware Auxiliary Learning
K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs
Event-based Tracking of Any Point with Motion-Robust Correlation Features
Generalizable Object Keypoint Localization from Generative Priors
Data Distributional Properties As Inductive Bias for Systematic Generalization
Boltzmann Attention Sampling for Image Analysis with Small Objects
DIFFER: Disentangling Identity Features via Semantic Cues for Clothes-Changing Person Re-ID
Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves
Feature Spectrum Learning for Remote Sensing Change Detection
Steady Progress Beats Stagnation: Mutual Aid of Foundation and Conventional Models in Mixed Domain Semi-Supervised Medical Image Segmentation
Paint by Inpaint: Learning to Add Image Objects by Removing Them First
Active Event-based Stereo Vision
GBlobs: Explicit Local Structure via Gaussian Blobs for Improved Cross-Domain LiDAR-based 3D Object Detection
Generalized Recorrupted-to-Recorrupted: Self-Supervised Learning Beyond Gaussian Noise
POSTA: A Go-to Framework for Customized Artistic Poster Generation
Your ViT is Secretly an Image Segmentation Model
Automatic Spectral Calibration of Hyperspectral Images: Method, Dataset and Benchmark
Samba: A Unified Mamba-based Framework for General Salient Object Detection
ArticulatedGS: Self-supervised Digital Twin Modeling of Articulated Objects using 3D Gaussian Splatting
ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning
ProHOC: Probabilistic Hierarchical Out-of-Distribution Classification via Multi-Depth Networks
The Illusion of Unlearning: The Unstable Nature of Machine Unlearning in Text-to-Image Diffusion Models
Knowledge Memorization and Rumination for Pre-trained Model-based Class-Incremental Learning
From Zero to Detail: Deconstructing Ultra-High-Definition Image Restoration from Progressive Spectral Perspective
Optimus-2: Mulitimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy
TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception
Rate-In: Information-Driven Adaptive Dropout Rates for Improved Inference-Time Uncertainty Estimation
Rethink Visual-language Pretraining for Deepfake Detection: Multi-modal Interpretable Forged Face Detection
Training Data Provenance Verification: Did Your Model Use Synthetic Data from My Generative Model for Training?
GeoAvatar: Geometrically-Consistent Multi-Person Avatar Reconstruction from Sparse Multi-View Videos
BARD-GS: Blur-Aware Reconstruction of Dynamic Scenes via Gaussian Splatting
Dual Exposure Stereo for Extended Dynamic Range 3D Imaging
Channel-wise Noise Scheduled Diffusion for Inverse Rendering in Indoor Scenes
DecoupledGaussian: Object-Scene Decoupling for Physics-Based Interaction
MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model
Uncertainty Meets Diversity: A Comprehensive Active Learning Framework for Indoor 3D Object Detection
HalLoc: Token-level Localization of Hallucinations for Vision Language Models
Augmenting Perceptual Super-Resolution via Image Quality Predictors
Personalized Preference Fine-tuning of Diffusion Models
Adaptive Keyframe Sampling for Long Video Understanding
Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing
Bayesian Prompt Flow Learning for Zero-Shot Anomaly Detection
Fortifying Federated Learning Towards Trustworthiness via Auditable Data Valuation and Verifiable Client Contribution
Hyperdimensional Uncertainty Quantification for Multimodal Uncertainty Fusion in Autonomous Vehicles Perception
Faster Parameter-Efficient Tuning with Token Redundancy Reduction
Number it: Temporal Grounding Videos like Flipping Manga
A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning
Hierarchical Compact Clustering Attention (COCA) for Unsupervised Object-Centric Learning
Quad-Pixel Image Defocus Deblurring: A New Benchmark and Model
Identifying and Mitigating Spurious Correlation in Multi-Task Learning
Language-Guided Salient Object Ranking
SCSegamba: Lightweight Structure-Aware Vision Mamba for Crack Segmentation in Structures
Subnet-Aware Dynamic Supernet Training for Neural Architecture Search
BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers
Implicit Bias Injection Attacks against Text-to-Image Diffusion Models
GenFusion: Closing the loop between Reconstruction and Generation via Videos
RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins
Let Humanoid Robots Go Hiking! Integrative Skill Development over Complex Trails
AnimateAnything: Consistent and Controllable Animation for video generation
ECVC: Exploiting Non-Local Correlations in Multiple Frames for Contextual Video Compression
Generative Inbetweening through Frame-wise Conditions-Driven Video Generation
BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation
Deep Multi-View Multi-Label Learning with Incomplete Views and Noisy Labels
Identifying and Mitigating Position Bias of Multi-image Vision-Language Models
HORP: Human-Object Relation Priors Guided HOI Detection
Annotation Ambiguity Aware Semi-Supervised Medical Image Segmentation
TFCustom: Customized Image Generation with Time-Aware Frequency Feature Guidance
Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
ADD: A General Attribution-Driven Data Augmentation Framework for Boosting Image Super-Resolution
Free on the Fly: Enhancing Flexibility in Test-Time Adaptation with Online EM
Autoregressive Sequential Pretraining for Visual Tracking
Dynamic Updates for Language Adaptation in Visual-Language Tracking
Let Samples Speak: Mitigating Spurious Correlation by Exploiting the Clusterness of Samples
Continuous 3D Perception Model with Persistent State
Satellite to GroundScape - Large-scale Consistent Ground View Generation from Satellite Views
Anomize: Better Open Vocabulary Video Anomaly Detection
TAROT: Towards Essentially Domain-Invariant Robustness with Theoretical Justification
On the Generalization of Handwritten Text Recognition Models
FFaceNeRF: Few-shot Face Editing in Neural Radiance Fields
ESC: Erasing Space Concept for Knowledge Deletion
Attribute-Missing Multi-view Graph Clustering
Stop Walking in Circles! Bailing Out Early in Projected Gradient Descent
ViKIENet: Towards Efficient 3D Object Detection with Virtual Key Instance Enhanced Network
Classifier-to-Bias: Toward Unsupervised Automatic Bias Detection for Visual Classifiers
Towards Satellite Image Road Graph Extraction: A Global-Scale Dataset and A Novel Method
A Distractor-Aware Memory for Visual Object Tracking with SAM2
DiffLO: Semantic-Aware LiDAR Odometry with Diffusion-based Refinement
PosterO: Structuring Layout Trees to Enable Language Models in Generalized Content-Aware Layout Generation
Multi-Modal Synergistic Implicit Image Enhancement for Efficient Optical Flow Estimation
Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels
Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking
DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations
Rethinking Lanes and Points in Complex Scenarios for Monocular 3D Lane Detection
I2VGuard: Safeguarding Images against Misuse in Diffusion-based Image-to-Video Models
Towards Source-Free Machine Unlearning
GenVDM: Generating Vector Displacement Maps From a Single Image
Gaussian Eigen Models for Human Heads
MeshGen: Generating PBR Textured Mesh with Render-Enhanced Auto-Encoder and Generative Data Augmentation
Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion
4Real-Video: Learning Generalizable Photo-realistic 4D Video Diffusion
Cubify Anything: Scaling Indoor 3D Object Detection
NSD-Imagery: A benchmark dataset for extending fMRI vision decoding methods to mental imagery
SuperPC: A Single Diffusion Model for Point Cloud Completion, Upsampling, Denoising, and Colorization
CGMatch: A Different Perspective of Semi-supervised Learning
Perceptual Video Compression with Neural Wrapping
CountLLM: Towards Generalizable Repetitive Action Counting via Large Language Model
SKE-Layout: Spatial Knowledge Enhanced Layout Generation with LLMs
Identity-preserving Distillation Sampling by Fixed-Point Iterator
LogoSP: Local-global Grouping of Superpoints for Unsupervised Semantic Segmentation of 3D Point Clouds
Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models
Scaling Down Text Encoders of Text-to-Image Diffusion Models
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counter Factual Reasoning
Maintaining Consistent Inter-Class Topology in Continual Test-Time Adaptation
3D Occupancy Prediction with Low-Resolution Queries via Prototype-aware View Transformation
3D-GSW: 3D Gaussian Splatting for Robust Watermarking
HyperNVD: Accelerating Neural Video Decomposition via Hypernetworks
We use cookies to store which papers have been visited.
I agree
Successful Page Load
We use cookies to store which papers have been visited.
I agree