CVPR 2025 Papers

Layout:

mini compact topic detail

AI-Face: A Million-Scale Demographically Annotated AI-Generated Face Dataset and Fairness Benchmark

S^3-Face: SSS-Compliant Facial Reflectance Estimation via Diffusion Priors

FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding

Keep the Balance: A Parameter-Efficient Symmetrical Framework for RGB+X Semantic Segmentation

VideoDirector: Precise Video Editing via Text-to-Video Models

LLM-driven Multimodal and Multi-Identity Listening Head Generation

Towards Understanding How Knowledge Evolves in Large Vision-Language Models

A Unified, Resilient, and Explainable Adversarial Patch Detector

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

Structured 3D Latents for Scalable and Versatile 3D Generation

GA3CE: Unconstrained 3D Gaze Estimation with Gaze-Aware 3D Context Encoding

CoMapGS: Covisibility Map-based Gaussian Splatting for Sparse Novel View Synthesis

Any-Resolution AI-Generated Image Detection by Spectral Learning

STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding

Improving the Training of Data-Efficient GANs via Quality Aware Dynamic Discriminator Rejection Sampling

Shading Meets Motion: Self-supervised Indoor 3D Reconstruction Via Simultaneous Shape-from-Shading and Structure-from-Motion

Believing is Seeing: Unobserved Object Detection using Generative Models

NLPrompt: Noise-Label Prompt Learning for Vision-Language Models

MEGA: Masked Generative Autoencoder for Human Mesh Recovery

PBR-NeRF: Inverse Rendering with Physics-Based Neural Fields

Disentangling Safe and Unsafe Image Corruptions via Anisotropy and Locality

Prometheus: 3D-Aware Latent Diffusion Models for Feed-Forward Text-to-3D Scene Generation

No Pains, More Gains: Recycling Sub-Salient Patches for Efficient High-Resolution Image Recognition

SphereUFormer: A U-Shaped Transformer for Spherical 360 Perception

Advancing Generalizable Tumor Segmentation with Anomaly-Aware Open-Vocabulary Attention Maps and Frozen Foundation Diffusion Models

Incomplete Multi-modal Brain Tumor Segmentation via Learnable Sorting State Space Model

FreeUV: Ground-Truth-Free Realistic Facial UV Texture Recovery via Cross-Assembly Inference Strategy

HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization

Rethinking Diffusion for Text-Driven Human Motion Generation: Redundant Representations, Evaluation, and Masked Autoregression

OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking

ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large Language Models

Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget

Guiding Human-Object Interactions with Rich Geometry and Relations

TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion

Physical Plausibility-aware Trajectory Prediction via Locomotion Embodiment

CADDreamer: CAD Object Generation from Single-view Images

Vision-Language Model IP Protection via Prompt-based Learning

Where's the Liability in the Generative Era? Recovery-based Black-Box Detection of AI-Generated Content

DiTASK: Multi-Task Fine-Tuning with Diffeomorphic Transformations

OW-OVD: Unified Open World and Open Vocabulary Object Detection

Omni-Scene: Omni-Gaussian Representation for Ego-Centric Sparse-View Scene Reconstruction

Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing

AvatarArtist: Open-Domain 4D Avatarization

DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models

Using Powerful Prior Knowledge of Diffusion Model in Deep Unfolding Networks for Image Compressive Sensing

Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content

VASparse: Towards Efficient Visual Hallucination Mitigation via Visual-Aware Token Sparsification

SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models

Noise-Consistent Siamese-Diffusion for Medical Image Synthesis and Segmentation

DefectFill: Realistic Defect Generation with Inpainting Diffusion Model for Visual Inspection

FedMIA: An Effective Membership Inference Attack Exploiting "All for One" Principle in Federated Learning

Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis

DPFlow: Adaptive Optical Flow Estimation with a Dual-Pyramid Framework

Ferret: An Efficient Online Continual Learning Framework under Varying Memory Constraints

Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling

VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?

Dual-Interrelated Diffusion Model for Few-Shot Anomaly Image Generation

ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models

Semantic and Sequential Alignment for Referring Video Object Segmentation

Self-Supervised Learning for Color Spike Camera Reconstruction

From Elements to Design: A Layered Approach for Automatic Graphic Design Composition

Masking meets Supervision: A Strong Learning Alliance

MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures

DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation

Are Spatial-Temporal Graph Convolution Networks for Human Action Recognition Over-Parameterized?

DA-VPT: Semantic-Guided Visual Prompt Tuning for Vision Transformers

Towards Lossless Implicit Neural Representation via Bit Plane Decomposition

MambaIRv2: Attentive State Space Restoration

Spectral State Space Model for Rotation-Invariant Visual Representation Learning

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

CoMatcher: Multi-View Collaborative Feature Matching

Taming Teacher Forcing for Masked Autoregressive Video Generation

UniRestore: Unified Perceptual and Task-Oriented Image Restoration Model Using Diffusion Prior

SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE

Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis and Manipulation

URWKV: Unified RWKV Model with Multi-state Perspective for Low-light Image Restoration

Revisiting Backdoor Attacks against Large Vision-Language Models from Domain Shift

Condensing Action Segmentation Datasets via Generative Network Inversion

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

DriveGPT4-V2: Harnessing Large Language Model Capabilities for Enhanced Closed-Loop Autonomous Driving

High-Fidelity Lightweight Mesh Reconstruction from Point Clouds

OSDFace: One-Step Diffusion Model for Face Restoration

Task Singular Vectors: Reducing Task Interference in Model Merging

Dragin3D: Image Editing by Dragging in 3D Space

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

Self-Evolving Visual Concept Library using Vision-Language Critics

TFCustom: Customized Image Generation with Time-Aware Frequency Feature Guidance

SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models

Invisible Backdoor Attack against Self-supervised Learning

BWFormer: Building Wireframe Reconstruction from Airborne LiDAR Point Cloud with Transformer

Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models

AffordDP: Generalizable Diffusion Policy with Transferable Affordance

HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation

OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning

ORIDa: Object-centric Real-world Image Composition Dataset

MeGA: Hybrid Mesh-Gaussian Head Avatar for High-Fidelity Rendering and Head Editing

Image Generation Diversity Issues and How to Tame Them

Annotation Ambiguity Aware Semi-Supervised Medical Image Segmentation

CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models

Comprehensive Information Bottleneck for Unveiling Universal Attribution to Interpret Vision Transformers

OpticalNet: An Optical Imaging Dataset and Benchmark Beyond the Diffraction Limit

Dataset Distillation with Neural Characteristic Function: A Minmax Perspective

Free-viewpoint Human Animation with Pose-correlated Reference Selection

CORE4D: A 4D Human-Object-Human Interaction Dataset for Collaborative Object REarrangement

PillarHist: A Quantization-aware Pillar Feature Encoder based on Height-aware Histogram

StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements

POp-GS: Next Best View in 3D-Gaussian Splatting with P-Optimality

Empowering Vector Graphics with Consistently Arbitrary Viewing and View-dependent Visibility

Semantic and Expressive Variations in Image Captions Across Languages

ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

DIFFER: Disentangling Identity Features via Semantic Cues for Clothes-Changing Person Re-ID

Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

Towards Explicit Geometry-Reflectance Collaboration for Generalized LiDAR Segmentation in Adverse Weather

Mono3DVLT: Monocular-Video-Based 3D Visual Language Tracking

Towards Universal Dataset Distillation via Task-Driven Diffusion

Parametric Point Cloud Completion for Polygonal Surface Reconstruction

SyncSDE: A Probabilistic Framework for Diffusion Synchronization

SEAL: Semantic Attention Learning for Long Video Representation

MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation

LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping

Paint by Inpaint: Learning to Add Image Objects by Removing Them First

Dual Semantic Guidance for Open Vocabulary Semantic Segmentation

CroCoDL: Cross-device Collaborative Dataset for Localization

PURA: Parameter Update-Recovery Test-Time Adaption for RGB-T Tracking

Glossy Object Reconstruction with Cost-effective Polarized Acquisition

Generalizable Object Keypoint Localization from Generative Priors

L-SWAG: Layer-Sample Wise Activation with Gradients Information for Zero-Shot NAS on Vision Transformers

What Makes a Good Dataset for Knowledge Distillation?

Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts

PartGen: Part-level 3D Generation and Reconstruction with Multi-view Diffusion Models

Yo’Chameleon: Personalized Vision and Language Generation

Chebyshev Attention Depth Permutation Texture Network with Latent Texture Attribute Loss

FedCALM: Conflict-aware Layer-wise Mitigation for Selective Aggregation in Deeper Personalized Federated Learning

FlexGS: Train Once, Deploy Everywhere with Many-in-One Flexible 3D Gaussian Splatting

Precise, Fast, and Low-cost Concept Erasure in Value Space: Orthogonal Complement Matters

HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation

T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation

Order-One Rolling Shutter Cameras

Shining Yourself: High-Fidelity Ornaments Virtual Try-on with Diffusion Model

Make It Count: Text-to-Image Generation with an Accurate Number of Objects

AnomalyNCD: Towards Novel Anomaly Class Discovery in Industrial Scenarios

Task-Specific Gradient Adaptation for Few-Shot One-Class Classification

TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception

DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching

3D Gaussian Inpainting with Depth-Guided Cross-View Consistency

Synthetic Prior for Few-Shot Drivable Head Avatar Inversion

Focusing on Tracks for Online Multi-Object Tracking

Floxels: Fast Unsupervised Voxel Based Scene Flow Estimation

Identity-preserving Distillation Sampling by Fixed-Point Iterator

VladVA: Discriminative Fine-tuning of LVLMs

HumanMM: Global Human Motion Recovery from Multi-shot Videos

Removing Reflections from RAW Photos

AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment

VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models

MICAS: Multi-grained In-Context Adaptive Sampling for 3D Point Cloud Processing

Can Text-to-Video Generation help Video-Language Alignment?

GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving

ColabSfM: Collaborative Structure-from-Motion by Point Cloud Registration

RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives

Do We Always Need the Simplicity Bias? Looking for Optimal Inductive Biases in the Wild

MangaNinja: Line Art Colorization with Precise Reference Following

Nonisotropic Gaussian Diffusion for Realistic 3D Human Motion Prediction

LookCloser: Frequency-aware Radiance Field for Tiny-Detail Scene

PICO: Reconstructing 3D People In Contact with Objects

Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World

TriTex: Learning Texture from a Single Mesh via Triplane Semantic Features

Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition

Scaling up Image Segmentation across Data and Tasks

MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation

Take the Bull by the Horns: Learning to Segment Hard Samples

Bridging Past and Future: End-to-End Autonomous Driving with Historical Prediction and Planning

EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation

Temporal Score Analysis for Understanding and Correcting Diffusion Artifacts

Cross-View Completion Models are Zero-shot Correspondence Estimators

Multi-party Collaborative Attention Control for Image Customization

Reproducible Vision-Language Models Meet Concepts Out of Pre-Training

Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

AIM-Fair: Advancing Algorithmic Fairness via Selectively Fine-Tuning Biased Models with Contextual Synthetic Data

ReWind: Understanding Long Videos with Instructed Learnable Memory

Segment Anything, Even Occluded

Advancing Multiple Instance Learning with Continual Learning for Whole Slide Imaging

ABBSPO: Adaptive Bounding Box Scaling and Symmetric Prior based Orientation Prediction for Detecting Aerial Image Objects

Decoupled Distillation to Erase: A General Unlearning Method for Any Class-centric Tasks

TAET: Two-Stage Adversarial Equalization Training on Long-Tailed Distributions

Do Your Best and Get Enough Rest for Continual Learning

MUSt3R: Multi-view Network for Stereo 3D Reconstruction

Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation

Vision-Guided Action: Enhancing 3D Human Motion Prediction with Gaze-informed Affordance in 3D Scenes

ChainHOI: Joint-based Kinematic Chain Modeling for Human-Object Interaction Generation

CLOC: Contrastive Learning for Ordinal Classification with Multi-Margin N-pair Loss

Universal Actions for Enhanced Embodied Foundation Models

ObjectMover: Generative Object Movement with Video Prior

LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty

DiffFNO: Diffusion Fourier Neural Operator

SEC-Prompt:SEmantic Complementary Prompting for Few-Shot Class-Incremental Learning

LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes

WAVE: Weight Templates for Adaptive Initialization of Variable-sized Models

KAC: Kolmogorov-Arnold Classifier for Continual Learning

PI-HMR: Towards Robust In-bed Temporal Human Shape Reconstruction with Contact Pressure Sensing

CXPMRG-Bench: Pre-training and Benchmarking for X-ray Medical Report Generation on CheXpert Plus Dataset

SEEN-DA: SEmantic ENtropy guided Domain-aware Attention for Domain Adaptive Object Detection

Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking

Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues

CraftsMan3D: High-fidelity Mesh Generation with 3D Native Diffusion and Interactive Geometry Refiner

Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition

Semantic-guided Cross-Modal Prompt Learning for Skeleton-based Zero-shot Action Recognition

ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting

Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders

VEU-Bench: Towards Comprehensive Understanding of Video Editing

PatchVSR: Breaking Video Diffusion Resolution Limits with Patch-wise Video Super-Resolution

Free360: Layered Gaussian Splatting for Unbounded 360-Degree View Synthesis from Extremely Sparse and Unposed Views

FluxSpace: Disentangled Semantic Editing in Rectified Flow Models

Touch2Shape: Touch-Conditioned 3D Diffusion for Shape Exploration and Reconstruction

ProAPO: Progressively Automatic Prompt Optimization for Visual Classification

ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts

Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events

TANGO: Training-free Embodied AI Agents for Open-world Tasks

Silence is Golden: Leveraging Adversarial Examples to Nullify Audio Control in LDM-based Talking-Head Generation

RNG: Relightable Neural Gaussians

Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning

DexGrasp Anything: Towards Universal Robotic Dexterous Grasping with Physics Awareness

Vision-Language Gradient Descent-driven All-in-One Deep Unfolding Networks

FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model

3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer

VL2Lite: Task-Specific Knowledge Distillation from Large Vision-Language Models to Lightweight Networks

VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding

From Laboratory to Real World: A New Benchmark Towards Privacy-Preserved Visible-Infrared Person Re-Identification

AToM: Aligning Text-to-Motion Model at Event-Level with GPT-4Vision Reward

4Deform: Neural Surface Deformation for Robust Shape Interpolation

Dense Match Summarization for Faster Two-view Estimation

MultiGO: Towards Multi-level Geometry Learning for Monocular 3D Textured Human Reconstruction

Align-A-Video: Deterministic Reward Tuning of Image Diffusion Models for Consistent Video Editing

Interpreting Object-level Foundation Models via Visual Precision Search

CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction

Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level

Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation

PMA: Towards Parameter-Efficient Point Cloud Understanding via Point Mamba Adapter

All-directional Disparity Estimation for Real-world QPD Images

LC-Mamba: Local and Continuous Mamba with Shifted Windows for Frame Interpolation

Deterministic-to-Stochastic Diverse Latent Feature Mapping for Human Motion Synthesis

Distinguish Then Exploit: Source-free Open Set Domain Adaptation via Weight Barcode Estimation and Sparse Label Assignment

GPAvatar: High-fidelity Head Avatars by Learning Efficient Gaussian Projections

PIAD: Pose and Illumination agnostic Anomaly Detection

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

COBRA: COmBinatorial Retrieval Augmentation for Few-Shot Adaptation

MFogHub: Bridging Multi-Regional and Multi-Satellite Data for Global Marine Fog Detection and Forecasting

Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera

Descriptor-In-Pixel : Point-Feature Tracking For Pixel Processor Arrays

Dual Consolidation for Pre-Trained Model-Based Domain-Incremental Learning

EmoEdit: Evoking Emotions through Image Manipulation

PIDSR: Complementary Polarized Image Demosaicing and Super-Resolution

RORem: Training a Robust Object Remover with Human-in-the-Loop

Video-Bench: Human-Aligned Video Generation Benchmark

PMNI: Pose-free Multi-view Normal Integration for Reflective and Textureless Surface Reconstruction

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

Joint Optimization of Neural Radiance Fields and Continuous Camera Motion from a Monocular Video

IRGS: Inter-Reflective Gaussian Splatting with 2D Gaussian Ray Tracing

Meta-Learning Hyperparameters for Parameter Efficient Fine-Tuning

EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision

Efficient Test-time Adaptive Object Detection via Sensitivity-Guided Pruning

A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning

TAMT: Temporal-Aware Model Tuning for Cross-Domain Few-Shot Action Recognition

BioX-CPath: Biologically-driven Explainable Diagnostics for Multistain IHC Computational Pathology

DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation

ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration

Mesh Mamba: A Unified State Space Model for Saliency Prediction in Non-Textured and Textured Meshes

Wavelet and Prototype Augmented Query-based Transformer for Pixel-level Surface Defect Detection

AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM

GauCho: Gaussian Distributions with Cholesky Decomposition for Oriented Object Detection

Alignment, Mining and Fusion: Representation Alignment with Hard Negative Mining and Selective Knowledge Fusion for Medical Visual Question Answering

FineVQ: Fine-Grained User Generated Content Video Quality Assessment

Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs

Mind the Gap: Detecting Black-box Adversarial Attacks in the Making through Query Update Analysis

Consistent Normal Orientation for 3D Point Clouds via Least Squares on Delaunay Graph

ICP: Immediate Compensation Pruning for Mid-to-high Sparsity

ATA: Adaptive Transformation Agent for Text-Guided Subject-Position Variable Background Inpainting

Optimizing for the Shortest Path in Denoising Diffusion Model

Dynamic Pseudo Labeling via Gradient Cutting for High-Low Entropy Exploration

VODiff: Controlling Object Visibility Order in Text-to-Image Generation

CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation

Seeing the Abstract: Translating the Abstract Language for Vision Language Models

Leveraging SD Map to Augment HD Map-based Trajectory Prediction

HyperPose: Hypernetwork-Infused Camera Pose Localization and an Extended Cambridge Landmarks Dataset

4DGC: Rate-Aware 4D Gaussian Compression for Efficient Streamable Free-Viewpoint Video

Apply Hierarchical-Chain-of-Generation to Complex Attributes Text-to-3D Generation

MAD: Memory-Augmented Detection of 3D Objects

PersonaHOI: Effortlessly Improving Face Personalization in Human-Object Interaction Generation

Distilled Prompt Learning for Incomplete Multimodal Survival Prediction

Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models

MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models

SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving

Shift the Lens: Environment-Aware Unsupervised Camouflaged Object Detection

Probability Density Geodesics in Image Diffusion Latent Space

High-quality Point Cloud Oriented Normal Estimation via Hybrid Angular and Euclidean Distance Encoding

DriveScape: High-Resolution Driving Video Generation by Multi-View Feature Fusion

Advancing Manga Analysis: Comprehensive Segmentation Annotations for the Manga109 Dataset

Training-free Neural Architecture Search through Variance of Knowledge of Deep Network Weights

Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond

EgoLife: Towards Egocentric Life Assistant

NTR-Gaussian: Nighttime Dynamic Thermal Reconstruction with 4D Gaussian Splatting Based on Thermodynamics

RAEncoder: A Label-Free Reversible Adversarial Examples Encoder for Dataset Intellectual Property Protection

BrepGiff: Lightweight Generation of Complex B-rep with 3D GAT Diffusion

Towards Fine-Grained Interpretability: Counterfactual Explanations for Misclassification with Saliency Partition

Progressive Correspondence Regenerator for Robust 3D Registration

Reference-Based 3D-Aware Image Editing with Triplanes

Decompositional Neural Scene Reconstruction with Generative Diffusion Prior

Distribution Prototype Diffusion Learning for Open-set Supervised Anomaly Detection

Unlocking Generalization Power in LiDAR Point Cloud Registration

Learning Visual Generative Priors without Text

3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting

Not Just Text: Uncovering Vision Modality Typographic Threats in Image Generation Models

EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

Teaching Large Language Models to Regress Accurate Image Quality Scores Using Score Distribution

v-CLR: View-Consistent Learning for Open-World Instance Segmentation

V^2Dial: Unification of Video and Visual Dialog via Multimodal Experts

Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference

Speedy-Splat: Fast 3D Gaussian Splatting with Sparse Pixels and Sparse Primitives

Mamba-Reg: Vision Mamba Also Needs Registers

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

It’s a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

AirRoom: Objects Matter in Room Reidentification

Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach

SuperPC: A Single Diffusion Model for Point Cloud Completion, Upsampling, Denoising, and Colorization

Let Samples Speak: Mitigating Spurious Correlation by Exploiting the Clusterness of Samples

Interpretable Image Classification via Non-parametric Part Prototype Learning

Mr. DETR: Instructive Multi-Route Training for Detection Transformers

Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models

Prototype-Based Image Prompting for Weakly Supervised Histopathological Image Segmentation

HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation

Less Attention is More: Prompt Transformer for Generalized Category Discovery

Minority-Focused Text-to-Image Generation via Prompt Optimization

Imputation-free and Alignment-free: Incomplete Multi-view Clustering Driven by Consensus Semantic Learning

Sensitivity-Aware Efficient Fine-Tuning via Compact Dynamic-Rank Adaptation

MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting

CADRef: Robust Out-of-Distribution Detection via Class-Aware Decoupled Relative Feature Leveraging

DaCapo: Score Distillation as Stacked Bridge for Fast and High-quality 3D Editing

TSAM: Temporal SAM Augmented with Multimodal Prompts for Referring Audio-Visual Segmentation

A Selective Re-learning Mechanism for Hyperspectral Fusion Imaging

Fish-Vista: A Multi-Purpose Dataset for Understanding & Identification of Traits from Images

Autoregressive Sequential Pretraining for Visual Tracking

Number it: Temporal Grounding Videos like Flipping Manga

SyncVP: Joint Diffusion for Synchronous Multi-Modal Video Prediction

HUSH: Holistic Panoramic 3D Scene Understanding using Spherical Harmonics

SkillMimic: Learning Basketball Interaction Skills from Demonstrations

VISTREAM: Improving Computation Efficiency of Visual Streaming Perception via Law-of-Charge-Conservation Inspired Spiking Neural Network

STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding

SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis

RGBAvatar: Reduced Gaussian Blendshapes for Online Modeling of Head Avatars

EEE-Bench: A Comprehensive Multimodal Electrical And Electronics Engineering Benchmark

Text-Driven Fashion Image Editing with Compositional Concept Learning and Counterfactual Abduction

Good, Cheap, and Fast: Overfitted Image Compression with Wasserstein Distortion

VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting

SoMA: Singular Value Decomposed Minor Components Adaptation for Domain Generalizable Representation Learning

Composing Parts for Expressive Object Generation

DynaMoDe-NeRF: Motion-aware Deblurring Neural Radiance Field for Dynamic Scenes

Adapting Text-to-Image Generation with Feature Difference Instruction for Generic Image Restoration

Tokenize Image Patches: Global Context Fusion for Effective Haze Removal in Large Images

UNOPose: Unseen Object Pose Estimation with an Unposed RGB-D Reference Image

Tuning the Frequencies: Robust Training for Sinusoidal Neural Networks

Real-time Free-view Human Rendering from Sparse-view RGB Videos using Double Unprojected Textures

ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark

Uncertainty Meets Diversity: A Comprehensive Active Learning Framework for Indoor 3D Object Detection

CASAGPT: Cuboid Arrangement and Scene Assembly for Interior Design

GIVEPose: Gradual Intra-class Variation Elimination for RGB-based Category-Level Object Pose Estimation

Identity-Clothing Similarity Modeling for Unsupervised Clothing Change Person Re-Identification

MINIMA: Modality Invariant Image Matching

3D Prior Is All You Need: Cross-Task Few-shot 2D Gaze Estimation

Spotting the Unexpected (STU): A 3D LiDAR Dataset for Anomaly Segmentation in Autonomous Driving

Is `Right' Right? Enhancing Object Orientation Understanding in Multimodal Large Language Models through Egocentric Instruction Tuning

GCC: Generative Color Constancy via Diffusing a Color Checker

Blind Bitstream-corrupted Video Recovery via Metadata-guided Diffusion Model

On Denoising Walking Videos for Gait Recognition

Conformal Prediction for Zero-Shot Models

PhysAnimator: Physics-Guided Generative Cartoon Animation

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding

Weakly Supervised Temporal Action Localization via Dual-Prior Collaborative Learning Guided by Multimodal Large Language Models

HotSpot: Signed Distance Function Optimization with an Asymptotically Sufficient Condition

BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs

EntitySAM: Segment Everything in Video

Scene-agnostic Pose Regression for Visual Localization

GS-2DGS: Geometrically Supervised 2DGS for Reflective Object Reconstruction

Libra-Merging: Importance-redundancy and Pruning-merging Trade-off for Acceleration Plug-in in Large Vision-Language Model

Point2RBox-v2: Rethinking Point-supervised Oriented Object Detection with Spatial Layout Among Instances

VasTSD: Learning 3D Vascular Tree-state Space Diffusion Model for Angiography Synthesis

GenFusion: Closing the Loop between Reconstruction and Generation via Videos

PanSplat: 4K Panorama Synthesis with Feed-Forward Gaussian Splatting

Towards Realistic Example-based Modeling via 3D Gaussian Stitching

Saliuitl: Ensemble Salience Guided Recovery of Adversarial Patches against CNNs

WISNet: Pseudo Label Generation on Unbalanced and Patch Annotated Waste Images

RaSS: Improving Denoising Diffusion Samplers with Reinforced Active Sampling Scheduler

MixerMDM: Learnable Composition of Human Motion Diffusion Models

LEDiff: Latent Exposure Diffusion for HDR Generation

Video Depth Anything: Consistent Depth Estimation for Super-Long Videos

RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers

Luminance-GS: Adapting 3D Gaussian Splatting to Challenging Lighting Conditions with View-Adaptive Curve Adjustment

Pathways on the Image Manifold: Image Editing via Video Generation

EventSplat: 3D Gaussian Splatting from Moving Event Cameras for Real-time Rendering

3D Student Splatting and Scooping

LOGICZSL: Exploring Logic-induced Representation for Compositional Zero-shot Learning

A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs

Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach

SemanticDraw: Towards Real-Time Interactive Content Creation from Image Diffusion Models

Learning Partonomic 3D Reconstruction from Image Collections

EVOS: Efficient Implicit Neural Training via EVOlutionary Selector

CoMBO: Conflict Mitigation via Branched Optimization for Class Incremental Segmentation

Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection

Cross-Modal Interactive Perception Network with Mamba for Lung Tumor Segmentation in PET-CT Images

No Thing, Nothing: Highlighting Safety-Critical Classes for Robust LiDAR Semantic Segmentation in Adverse Weather

MEET: Towards Memory-Efficient Temporal Sparse Deep Neural Networks

Probabilistic Prompt Distribution Learning for Animal Pose Estimation

Reconstructing Humans with a Biomechanically Accurate Skeleton

AdaCM^2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction

VGGT: Visual Geometry Grounded Transformer

Show and Segment: Universal Medical Image Segmentation via In-Context Learning

LUCAS: Layered Universal Codec Avatars

Enhancing Facial Privacy Protection via Weakening Diffusion Purification

3DEnhancer: Consistent Multi-View Diffusion for 3D Enhancement

SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration

Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models

Mani-GS: Gaussian Splatting Manipulation with Triangular Mesh

UniHOPE: A Unified Approach for Hand-Only and Hand-Object Pose Estimation

RL-RC-DoT: A Block-level RL agent for Task-Aware Video Compression

MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention

PHGC: Procedural Heterogeneous Graph Completion for Natural Language Task Verification in Egocentric Videos

Recognition-Synergistic Scene Text Editing

DoF-Gaussian: Controllable Depth-of-Field for 3D Gaussian Splatting

Towards Consistent Multi-Task Learning: Unlocking the Potential of Task-Specific Parameters

Rectified Diffusion Guidance for Conditional Generation

EdgeTAM: On-Device Track Anything Model

IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments

Towards In-the-wild 3D Plane Reconstruction from a Single Image

FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis

RAD: Region-Aware Diffusion Models for Image Inpainting

Supervising Sound Localization by In-the-wild Egomotion

AutoLUT: LUT-Based Image Super-Resolution with Automatic Sampling and Adaptive Residual Learning

Understanding Fine-tuning CLIP for Open-vocabulary Semantic Segmentation in Hyperbolic Space

TexGaussian: Generating High-quality PBR Material via Octree-based 3D Gaussian Splatting

OSV: One Step is Enough for High-Quality Image to Video Generation

BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations

Scaling Vision Pre-Training to 4K Resolution

Fuzzy Multimodal Learning for Trusted Cross-modal Retrieval

NVILA: Efficient Frontier Visual Language Models

AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction

GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration

Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning

Concept Replacer: Replacing Sensitive Concepts in Diffusion Models via Precision Localization

Deep Fair Multi-View Clustering with Attention KAN

SparseAlign: a Fully Sparse Framework for Cooperative Object Detection

LineArt: A Knowledge-guided Training-free High-quality Appearance Transfer for Design Drawing with Diffusion Model

4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion

Scene Map-based Prompt Tuning for Navigation Instruction Generation

F-LMM: Grounding Frozen Large Multimodal Models

Zero-Shot Image Restoration Using Few-Step Guidance of Consistency Models (and Beyond)

LidarGait++: Learning Local Features and Size Awareness from LiDAR Point Clouds for 3D Gait Recognition

UrbanCAD: Towards Highly Controllable and Photorealistic 3D Vehicles for Urban Scene Simulation

Diff-Palm: Realistic Palmprint Generation with Polynomial Creases and Intra-Class Variation Controllable Diffusion Models

UniNet: A Contrastive Learning-guided Unified Framework with Feature Selection for Anomaly Detection

Chain of Semantics Programming in 3D Gaussian Splatting Representation for 3D Vision Grounding

On the Zero-shot Adversarial Robustness of Vision-Language Models: A Truly Zero-shot and Training-free Approach

Towards General Visual-Linguistic Face Forgery Detection

Movie Weaver: Tuning-Free Multi-Concept Video Personalization with Anchored Prompts

MVPortrait: Text-Guided Motion and Emotion Control for Multi-view Vivid Portrait Animation

MoEdit: On Learning Quantity Perception for Multi-object Image Editing

Seeing More with Less: Human-like Representations in Vision Models

FALCON: Fairness Learning via Contrastive Attention Approach to Continual Semantic Scene Understanding

Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification

Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction

Matrix-Free Shared Intrinsics Bundle Adjustment

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Nested Diffusion Models Using Hierarchical Latent Priors

Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning

Localizing Events in Videos with Multimodal Queries

FlashGS: Efficient 3D Gaussian Splatting for Large-scale and High-resolution Rendering

CleanDIFT: Diffusion Features without Noise

Uncertainty-Instructed Structure Injection for Generalizable HD Map Construction

Not All Parameters Matter: Masking Diffusion Models for Enhancing Generation Ability

Doppelgängers and Adversarial Vulnerability

Don't Shake the Wheel: Momentum-Aware Planning in End-to-End Autonomous Driving

Complexity Experts are Task-Discriminative Learners for Any Image Restoration

Generative Omnimatte: Learning to Decompose Video into Layers

5%>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks

Precise Event Spotting in Sports Videos: Solving Long-Range Dependency and Class Imbalance

Steady Progress Beats Stagnation: Mutual Aid of Foundation and Conventional Models in Mixed Domain Semi-Supervised Medical Image Segmentation

ATP: Adaptive Threshold Pruning for Efficient Data Encoding in Quantum Neural Networks

dFLMoE: Decentralized Federated Learning via Mixture of Experts for Medical Data Analysis

AFL: A Single-Round Analytic Approach for Federated Learning with Pre-trained Models

MIMO: A Medical Vision Language Model with Visual Referring Multimodal Input and Pixel Grounding Multimodal Output

XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding

Efficient Data Driven Mixture-of-Expert Extraction from Trained Networks

Towards All-in-One Medical Image Re-Identification

SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

SceneCrafter: Controllable Multi-View Driving Scene Editing

AMO Sampler: Enhancing Text Rendering with Overshooting

I2VGuard: Safeguarding Images against Misuse in Diffusion-based Image-to-Video Models

Rashomon Sets for Prototypical-Part Networks: Editing Interpretable Models in Real-Time

HeatFormer: A Neural Optimizer for Multiview Human Mesh Recovery

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

GPS as a Control Signal for Image Generation

MAGiC-SLAM: Multi-Agent Gaussian Globally Consistent SLAM

MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model

Bayesian Test-Time Adaptation for Vision-Language Models

Causal Composition Diffusion Model for Closed-loop Traffic Generation

Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective

Hybrid Concept Bottleneck Models

DyMO: Training-Free Diffusion Model Alignment with Dynamic Multi-Objective Scheduling

HiMoR: Monocular Deformable Gaussian Reconstruction with Hierarchical Motion Representation

GENIUS: A Generative Framework for Universal Multimodal Search

Enhanced Visual-Semantic Interaction with Tailored Prompts for Pedestrian Attribute Recognition

SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement

Towards Precise Embodied Dialogue Localization via Causality Guided Diffusion

Customized Condition Controllable Generation for Video Soundtrack

EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

A4A: Adapter for Adapter Transfer via All-for-All Mapping for Cross-Architecture Models

ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation

A Universal Scale-Adaptive Deformable Transformer for Image Restoration across Diverse Artifacts

MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

WISE: A Framework for Gigapixel Whole-Slide-Image Lossless Compression

Gromov–Wasserstein Problem with Cyclic Symmetry

IRIS: Inverse Rendering of Indoor Scenes from Low Dynamic Range Images

RoGSplat: Learning Robust Generalizable Human Gaussian Splatting from Sparse Multi-View Images

SDBF: Steep-Decision-Boundary Fingerprinting for Hard-Label Tampering Detection of DNN Models

EnliveningGS: Active Locomotion of 3DGS

SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos

Knowledge-Aligned Counterfactual-Enhancement Diffusion Perception for Unsupervised Cross-Domain Visual Emotion Recognition

Distilling Multi-modal Large Language Models for Autonomous Driving

Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision

Can Machines Understand Composition? Dataset and Benchmark for Photographic Image Composition Embedding and Understanding

LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation

PhysGen3D: Crafting a Miniature Interactive World from a Single Image

Scaling Properties of Diffusion Models For Perceptual Tasks

HD-EPIC: A Highly-Detailed Egocentric Video Dataset

Exact: Exploring Space-Time Perceptive Clues for Weakly Supervised Satellite Image Time Series Semantic Segmentation

PolarFree: Polarization-based Reflection-Free Imaging

H-MoRe: Learning Human-centric Motion Representation for Action Analysis

Hierarchical Compact Clustering Attention (COCA) for Unsupervised Object-Centric Learning

Effortless Active Labeling for Long-Term Test-Time Adaptation

LATTE-MV: Learning to Anticipate Table Tennis Hits from Monocular Videos

Pay Attention to the Foreground in Object-Centric Learning

MultimodalStudio: A Heterogeneous Sensor Dataset and Framework for Neural Rendering across Multiple Imaging Modalities

Dense-SfM: Structure from Motion with Dense Consistent Matching

FluidNexus: 3D Fluid Reconstruction and Prediction from a Single Video

MuTri: Multi-view Tri-alignment for OCT to OCTA 3D Image Translation

Image Quality Assessment: Investigating Causal Perceptual Effects with Abductive Counterfactual Inference

OCRT: Boosting Foundation Models in the Open World with Object-Concept-Relation Triad

SPARS3R: Semantic Prior Alignment and Regularization for Sparse 3D Reconstruction

VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation

Learning Person-Specific Animatable Face Models from In-the-Wild Images via a Shared Base Model

TIMotion: Temporal and Interactive Framework for Efficient Human-Human Motion Generation

Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos

Hybrid Reciprocal Transformer with Triplet Feature Alignment for Scene Graph Generation

TransPixeler: Advancing Text-to-Video Generation with Transparency

Adaptive Keyframe Sampling for Long Video Understanding

Person De-reidentification: A Variation-guided Identity Shift Modeling

FreeSim: Toward Free-viewpoint Camera Simulation in Driving Scenes

Gradient Inversion Attacks on Parameter-Efficient Fine-Tuning

DiGIT: Multi-Dilated Gated Encoder and Central-Adjacent Region Integrated Decoder for Temporal Action Detection Transformer

VideoDPO: Omni-Preference Alignment for Video Diffusion Generation

Realistic Test-Time Adaptation of Vision-Language Models

SelfSplat: Pose-Free and 3D Prior-Free Generalizable 3D Gaussian Splatting

RDD: Robust Feature Detector and Descriptor using Deformable Transformer

Enhancing Virtual Try-On with Synthetic Pairs and Error-Aware Noise Scheduling

GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs

Erasing Undesirable Influence in Diffusion Models

LT3SD: Latent Trees for 3D Scene Diffusion

Stacking Brick by Brick: Aligned Feature Isolation for Incremental Face Forgery Detection

Closest Neighbors are Harmful for Lightweight Masked Auto-encoders

Decouple-Then-Merge: Finetune Diffusion Models as Multi-Task Learning

HELVIPAD: A Real-World Dataset for Omnidirectional Stereo Depth Estimation

GENMANIP: LLM-driven Simulation for Generalizable Instruction-Following Manipulation

SKDream: Controllable Multi-view and 3D Generation with Arbitrary Skeletons

Towards Enhanced Image Inpainting: Mitigating Unwanted Object Insertion and Preserving Color Consistency

Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy

GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

LatentHOI: On the Generalizable Hand Object Motion Generation with Latent Hand Diffusion.

Practical Solutions to the Relative Pose of Three Calibrated Cameras

Population Normalization for Federated Learning

AnimateAnything: Consistent and Controllable Animation for Video Generation

GenAssets: Generating in-the-wild 3D Assets in Latent Space

SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation

RipVIS: Rip Currents Video Instance Segmentation Benchmark for Beach Monitoring and Safety

Low-Rank Adaptation in Multilinear Operator Networks for Security-Preserving Incremental Learning

Camera Resection from Known Line Pencils and a Radially Distorted Scanline

ESCAPE: Equivariant Shape Completion via Anchor Point Encoding

SPC-GS: Gaussian Splatting with Semantic-Prompt Consistency for Indoor Open-World Free-view Synthesis from Sparse Inputs

Generalized Zero-Shot Classification via Semantics-Free Inter-Class Feature Generation

M3amba: Memory Mamba is All You Need for Whole Slide Image Classification

Variance-Based Membership Inference Attacks Against Large-Scale Image Captioning Models

Redefining in Dictionary: Towards an Enhanced Semantic Understanding of Creative Generation

Temporal Alignment-Free Video Matching for Few-shot Action Recognition

OSLoPrompt: Bridging Low-Supervision Challenges and Open-Set Domain Generalization in CLIP

VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary

MirrorVerse: Pushing Diffusion Models to Realistically Reflect the World

EAP-GS: Efficient Augmentation of Pointcloud for 3D Gaussian Splatting in Few-shot Scene Reconstruction

Empowering Large Language Models with 3D Situation Awareness

EchoTraffic: Enhancing Traffic Anomaly Understanding with Audio-Visual Insights

FlexDrive: Toward Trajectory Flexibility in Driving Scene Gaussian Splatting Reconstruction and Rendering

Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs

Interactive Medical Image Segmentation: A Benchmark Dataset and Baseline

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities

AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing

Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

DifIISR: A Diffusion Model with Gradient Guidance for Infrared Image Super-Resolution

FrugalNeRF: Fast Convergence for Extreme Few-shot Novel View Synthesis without Learned Priors

3D-GSW: 3D Gaussian Splatting for Robust Watermarking

Sketch Down the FLOPs: Towards Efficient Networks for Human Sketch

Pioneering 4-Bit FP Quantization for Diffusion Models: Mixup-Sign Quantization and Timestep-Aware Fine-Tuning

OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

Dual Exposure Stereo for Extended Dynamic Range 3D Imaging

A New Statistical Model of Star Speckles for Learning to Detect and Characterize Exoplanets in Direct Imaging Observations

CompGS: Unleashing 2D Compositionality for Compositional Text-to-3D via Dynamically Optimizing 3D Gaussians

Learning Temporally Consistent Video Depth from Video Diffusion Priors

FIRE: Robust Detection of Diffusion-Generated Images via Frequency-Guided Reconstruction Error

Assessing and Learning Alignment of Unimodal Vision and Language Models

Samba: A Unified Mamba-based Framework for General Salient Object Detection

PAVE: Patching and Adapting Video Large Language Models

LesionLocator: Zero-Shot Universal Tumor Segmentation and Tracking in 3D Whole-Body Imaging

Generative Map Priors for Collaborative BEV Semantic Segmentation

Coherent 3D Portrait Video Reconstruction via Triplane Fusion

Generative Image Layer Decomposition with Visual Effects

AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion

ManiVideo: Generating Hand-Object Manipulation Video with Dexterous and Generalizable Grasping

FlexUOD: The Answer to Real-world Unsupervised Image Outlier Detection

DOF-GS: Adjustable Depth-of-Field 3D Gaussian Splatting for Post-Capture Refocusing, Defocus Rendering and Blur Removal

The Photographer's Eye: Teaching Multimodal Large Language Models to See, and Critique Like Photographers

Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation

HOIGPT: Learning Long-Sequence Hand-Object Interaction with Language Models

GraphI2P: Image-to-Point Cloud Registration with Exploring Pattern of Correspondence via Graph Learning

FedCS: Coreset Selection for Federated Learning

DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models

Dual-Granularity Semantic Guided Sparse Routing Diffusion Model for General Pansharpening

Robust Multi-Object 4D Generation for In-the-wild Videos

SOAP: Vision-Centric 3D Semantic Scene Completion with Scene-Adaptive Decoder and Occluded Region-Aware View Projection

Taxonomy-Aware Evaluation of Vision-Language Models

Active Event-based Stereo Vision

SimVS: Simulating World Inconsistencies for Robust View Synthesis

FLAVC: Learned Video Compression with Feature Level Attention

SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning

Video Language Model Pretraining with Spatio-temporal Masking

COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

Lifting Motion to the 3D World via 2D Diffusion

TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models

Your ViT is Secretly an Image Segmentation Model

Cross-Rejective Open-Set SAR Image Registration

SplineGS: Robust Motion-Adaptive Spline for Real-Time Dynamic 3D Gaussians from Monocular Video

SCSA: A Plug-and-Play Semantic Continuous-Sparse Attention for Arbitrary Semantic Style Transfer

VIRES: Video Instance Repainting via Sketch and Text Guided Generation

Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model

Can't Slow Me Down: Learning Robust and Hardware-Adaptive Object Detectors against Latency Attacks for Edge Devices

SAM2Object: Consolidating View Consistency via SAM2 for Zero-Shot 3D Instance Segmentation

CDI: Copyrighted Data Identification in Diffusion Models

Binarized Neural Network for Multi-spectral Image Fusion

CRISP: Object Pose and Shape Estimation with Test-Time Adaptation

GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior

Creating Your Editable 3D Photorealistic Avatar with Tetrahedron-constrained Gaussian Splatting

Object-Shot Enhanced Grounding Network for Egocentric Video

MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Action Anticipation

METASCENES: Towards Automated Replica Creation for Real-world 3D Scans

Robust Multimodal Survival Prediction with Conditional Latent Differentiation Variational AutoEncoder

Sim-to-Real Causal Transfer: A Metric Learning Approach to Causally-Aware Interaction Representations

Zero-Shot Blind-spot Image Denoising via Implicit Neural Sampling

Nearly Zero-Cost Protection Against Mimicry by Personalized Diffusion Models

The Devil is in Temporal Token: High Quality Video Reasoning Segmentation

PerLA: Perceptive 3D Language Assistant

LITA-GS: Illumination-Agnostic Novel View Synthesis via Reference-Free 3D Gaussian Splatting and Physical Priors

PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation

Mask^2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

JamMa: Ultra-lightweight Local Feature Matching with Joint Mamba

DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models

T-FAKE: Synthesizing Thermal Images for Facial Landmarking

MammAlps: A Multi-view Video Behavior Monitoring Dataset of Wild Mammals in the Swiss Alps

Diffusion-based Realistic Listening Head Generation via Hybrid Motion Modeling

SAT-HMR: Real-Time Multi-Person 3D Mesh Estimation via Scale-Adaptive Tokens

PICD: Versatile Perceptual Image Compression with Diffusion Rendering

Wonderland: Navigating 3D Scenes from a Single Image

Learning from Streaming Video with Orthogonal Gradients

SuperLightNet: Lightweight Parameter Aggregation Network for Multimodal Brain Tumor Segmentation

Anyattack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-language Models

Classifier-to-Bias: Toward Unsupervised Automatic Bias Detection for Visual Classifiers

Self-Supervised Spatial Correspondence Across Modalities

MOS-Attack: A Scalable Multi-objective Adversarial Attack Framework

Motion Modes: What Could Happen Next?

Finer-CAM: Spotting the Difference Reveals Finer Details for Visual Explanation

Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models

VidSeg: Training-free Video Semantic Segmentation based on Diffusion Models

Weakly Supervised Semantic Segmentation via Progressive Confidence Region Expansion

RELOCATE: A Simple Training-Free Baseline for Visual Query Localization Using Region-Based Representations

HOTFormerLoc: Hierarchical Octree Transformer for Versatile Lidar Place Recognition Across Ground and Aerial Views

UniK3D: Universal Camera Monocular 3D Estimation

ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer

VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models

AG-VPReID: A Challenging Large-Scale Benchmark for Aerial-Ground Video-based Person Re-Identification

EBS-EKF: Accurate and High Frequency Event-based Star Tracking

PersonaBooth: Personalized Text-to-Motion Generation

Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery

SAIST: Segment Any Infrared Small Target Model Guided by Contrastive Language-Image Pretraining

Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models

DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models

Time of the Flight of the Gaussians: Optimizing Depth Indirectly in Dynamic Radiance Fields

Align3R: Aligned Monocular Depth Estimation for Dynamic Videos

Compositional Caching for Training-free Open-vocabulary Attribute Detection

Seek Common Ground While Reserving Differences: Semi-Supervised Image-Text Sentiment Recognition

CoLLM: A Large Language Model for Composed Image Retrieval

Anomize: Better Open Vocabulary Video Anomaly Detection

Efficient Diffusion as Low Light Enhancer

VI^3NR: Variance Informed Initialization for Implicit Neural Representations

M-LLM Based Video Frame Selection for Efficient Video Understanding

Search and Detect: Training-Free Long Tail Object Detection via Web-Image Retrieval

EgoLM: Multi-Modal Language Model of Egocentric Motions

Ges3ViG : Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding

Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation

Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation

Learning Heterogeneous Tissues with Mixture of Experts for Gigapixel Whole Slide Images

HyperNVD: Accelerating Neural Video Decomposition via Hypernetworks

UnCommon Objects in 3D

Disentangled Pose and Appearance Guidance for Multi-Pose Generation

Mind the Gap: Confidence Discrepancy Can Guide Federated Semi-Supervised Learning Across Pseudo-Mismatch

Instant Adversarial Purification with Adversarial Consistency Distillation

Electromyography-Informed Facial Expression Reconstruction for Physiological-Based Synthesis and Analysis

Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation

MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving

Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation

NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way

CMMLoc: Advancing Text-to-PointCloud Localization with Cauchy-Mixture-Model Based Framework

Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding

Decoupling Training-Free Guided Diffusion by ADMM

SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion

Revisiting Fairness in Multitask Learning: A Performance-Driven Approach for Variance Reduction

Learning from Synchronization: Self-Supervised Uncalibrated Multi-View Person Association in Challenging Scenes

RC-AutoCalib: An End-to-End Radar-Camera Automatic Calibration Network

CLIP-driven Coarse-to-fine Semantic Guidance for Fine-grained Open-set Semi-supervised Learning

InsTaG: Learning Personalized 3D Talking Head from Few-Second Video

Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning

ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation

Rotation-Equivariant Self-Supervised Method in Image Denoising

GLane3D: Detecting Lanes with Graph of 3D Keypoints

Minimal Interaction Seperated Tuning: A New Paradigm for Visual Adaptation

Hardware-Rasterized Ray-Based Gaussian Splatting

FreqDebias: Towards Generalizable Deepfake Detection via Consistency-Driven Frequency Debiasing

4DTAM: Non-Rigid Tracking and Mapping via Dynamic Surface Gaussians

Unseen Visual Anomaly Generation

ReNeg: Learning Negative Embedding with Reward Guidance

Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics

Decoder Gradient Shield: Provable and High-Fidelity Prevention of Gradient-Based Box-Free Watermark Removal

MotionPro: A Precise Motion Controller for Image-to-Video Generation

Goku: Flow Based Video Generative Foundation Models

Learning Conditional Space-Time Prompt Distributions for Video Class-Incremental Learning

Convex Combination Star Shape Prior for Data-driven Image Semantic Segmentation

Hyperbolic Safety-Aware Vision-Language Models

WISH: Weakly Supervised Instance Segmentation using Heterogeneous Labels

Relative Pose Estimation through Affine Corrections of Monocular Depth Priors

Occlusion-aware Text-Image-Point Cloud Pretraining for Open-World 3D Object Recognition

V2X-R: Cooperative LiDAR-4D Radar Fusion with Denoising Diffusion for 3D Object Detection

Foundations of the Theory of Performance-Based Ranking

Generating Multimodal Driving Scenes via Next-Scene Prediction

BIGS: Bimanual Category-agnostic Interaction Reconstruction from Monocular Videos via 3D Gaussian Splatting

APT: Adaptive Personalized Training for Diffusion Models with Limited Data

Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation

Frequency-Biased Synergistic Design for Image Compression and Compensation

PosterMaker: Towards High-Quality Product Poster Generation with Accurate Text Rendering

Sparse Voxels Rasterization: Real-time High-fidelity Radiance Field Rendering

Rethinking Personalized Aesthetics Assessment: Employing Physique Aesthetics Assessment as An Exemplification

You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

Locality-Aware Zero-Shot Human-Object Interaction Detection

PEACE: Empowering Geologic Map Holistic Understanding with MLLMs

SGFormer: Satellite-Ground Fusion for 3D Semantic Scene Completion

Random Conditioning with Distillation for Data-Efficient Diffusion Model Compression

MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation

Hierarchical Gaussian Mixture Model Splatting for Efficient and Part Controllable 3D Generation

ERUPT: Efficient Rendering with Unposed Patch Transformer

Rethinking End-to-End 2D to 3D Scene Segmentation in Gaussian Splatting

Quad-Pixel Image Defocus Deblurring: A New Benchmark and Model

DocVLM: Make Your VLM an Efficient Reader

Revisiting Source-Free Domain Adaptation: Insights into Representativeness, Generalization, and Variety

FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views

Improving Gaussian Splatting with Localized Points Management

GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency

ESC: Erasing Space Concept for Knowledge Deletion

Language Guided Concept Bottleneck Models for Interpretable Continual Learning

Domain Adaptive Diabetic Retinopathy Grading with Model Absence and Flowing Data

Temporal Separation with Entropy Regularization for Knowledge Distillation in Spiking Neural Networks

LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models

Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model

Theoretical Insights in Model Inversion Robustness and Conditional Entropy Maximization for Collaborative Inference Systems

SCFlow2: Plug-and-Play Object Pose Refiner with Shape-Constraint Scene Flow

D^3CTTA: Domain-Dependent Decorrelation for Continual Test-Time Adaption of 3D LiDAR Segmentation

Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution

FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations

Interpretable Generative Models through Post-hoc Concept Bottlenecks

SketchAgent: Language-Driven Sequential Sketch Generation

DRAWER: Digital Reconstruction and Articulation With Environment Realism

GoLF-NRT: Integrating Global Context and Local Geometry for Few-Shot View Synthesis

Deep Change Monitoring: A Hyperbolic Representative Learning Framework and a Dataset for Long-term Fine-grained Tree Change Detection

A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

Empowering LLMs to Understand and Generate Complex Vector Graphics

PanoGS: Gaussian-based Panoptic Segmentation for 3D Open Vocabulary Scene Understanding

Watermarking One for All: A Robust Watermarking Scheme Against Partial Image Theft

ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On

VolFormer: Explore More Comprehensive Cube Interaction for Hyperspectral Image Restoration and Beyond

Recovering Dynamic 3D Sketches from Videos

IM-Zero: Instance-level Motion Controllable Video Generation in a Zero-shot Manner

EigenGS Representation: From Eigenspace to Gaussian Image Space

SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection

MaSS13K: A Matting-level Semantic Segmentation Benchmark

EntropyMark: Towards More Harmless Backdoor Watermark via Entropy-based Constraint for Open-source Dataset Copyright Protection

Rethinking the Adversarial Robustness of Multi-Exit Neural Networks in an Attack-Defense Game

Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers

ROD-MLLM: Towards More Reliable Object Detection in Multimodal Large Language Models

RoboGround: Robotic Manipulation with Grounded Vision-Language Priors

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

Improving Transferable Targeted Attacks with Feature Tuning Mixup

OmniStereo: Real-time Omnidireactional Depth Estimation with Multiview Fisheye Cameras

DroneSplat: 3D Gaussian Splatting for Robust 3D Reconstruction from In-the-Wild Drone Imagery

DrivingSphere: Building a High-fidelity 4D World for Closed-loop Simulation

Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency

VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment

IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation

HuPerFlow: A Comprehensive Benchmark for Human vs. Machine Motion Estimation Comparison

Lessons and Insights from a Unifying Study of Parameter-Efficient Fine-Tuning (PEFT) in Visual Recognition

2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification

H2ST: Hierarchical Two-Sample Tests for Continual Out-of-Distribution Detection

MetaWriter: Personalized Handwritten Text Recognition Using Meta-Learned Prompt Tuning

Subnet-Aware Dynamic Supernet Training for Neural Architecture Search

CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology

MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders

CamFreeDiff: Camera-free Image to Panorama Generation with Diffusion Model

Improving Visual and Downstream Performance of Low-Light Enhancer with Vision Foundation Models Collaboration

EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance

FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs

Illumination Spectrum Estimation for Multispectral Images via Surface Reflectance Modeling and Spatial-Spectral Feature Generation

UHD-processer: Unified UHD Image Restoration with Progressive Frequency Learning and Degradation-aware Prompts

Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models

Neural Hierarchical Decomposition for Single Image Plant Modeling

Video-Guided Foley Sound Generation with Multimodal Controls

SACB-Net: Spatial-awareness Convolutions for Medical Image Registration

DCEvo: Discriminative Cross-Dimensional Evolutionary Learning for Infrared and Visible Image Fusion

TSD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution

AIpparel: A Multimodal Foundation Model for Digital Garments

Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass

CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

PO3AD: Predicting Point Offsets toward Better 3D Point Cloud Anomaly Detection

3D Occupancy Prediction with Low-Resolution Queries via Prototype-aware View Transformation

Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?

SET: Spectral Enhancement for Tiny Object Detection

g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks

Arc2Avatar: Generating Expressive 3D Avatars from a Single Image via ID Guidance

Temporal Action Detection Model Compression by Progressive Block Drop

Face Forgery Video Detection via Temporal Forgery Cue Unraveling

Temporally Consistent Object-Centric Learning by Contrasting Slots

UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics

Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction

Exploring Contextual Attribute Density in Referring Expression Counting

DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment

Learning Affine Correspondences by Integrating Geometric Constraints

UCOD-DPL: Unsupervised Camouflaged Object Detection via Dynamic Pseudo-label Learning

Geometry in Style: 3D Stylization via Surface Normal Deformation

Multi-modal Vision Pre-training for Medical Image Analysis

SegMAN: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

LIM: Large Interpolator Model for Dynamic Reconstruction

Multiple Object Tracking as ID Prediction

SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos

PIDLoc: Cross-View Pose Optimization Network Inspired by PID Controllers

VisionArena: 230k Real World User-VLM Conversations with Preference Labels

FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

Hash3D: Training-free Acceleration for 3D Generation

SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance

Generative Photomontage

Multi-view Reconstruction via SfM-guided Monocular Depth Estimation

Learning Hazing to Dehazing: Towards Realistic Haze Generation for Real-World Image Dehazing

HuMoCon: Concept Discovery for Human Motion Understanding

Curriculum Direct Preference Optimization for Diffusion and Consistency Models

OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection

Detecting Open World Objects via Partial Attribute Assignment

FactCheXcker: Mitigating Measurement Hallucinations in Chest X-ray Report Generation Models

Neural Inverse Rendering from Propagating Light

Personalized Preference Fine-tuning of Diffusion Models

UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

POMP: Physics-constrainable Motion Generative Model through Phase Manifolds

NN-Former: Rethinking Graph Structure in Neural Architecture Representation

ReSpec: Relevance and Specificity Grounded Online Filtering for Learning on Video-Text Data Streams

Reloc3r: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization

A Unified Image-Dense Annotation Generation Model for Underwater Scenes

PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models

R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization

DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution

OmniSplat: Taming Feed-Forward 3D Gaussian Splatting for Omnidirectional Images with Editable Capabilities

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

FSHNet: Fully Sparse Hybrid Network for 3D Object Detection

3D-SLNR: A Super Lightweight Neural Representation for Large-scale 3D Mapping

STINR: Deciphering Spatial Transcriptomics via Implicit Neural Representation

Remote Photoplethysmography in Real-World and Extreme Lighting Scenarios

Multi-Modal Contrastive Masked Autoencoders: A Two-Stage Progressive Pre-training Approach for RGBD Datasets

HaWoR: World-Space Hand Motion Reconstruction from Egocentric Videos

Font-Agent: Enhancing Font Understanding with Large Language Models

Secret Lies in Color: Enhancing AI-Generated Images Detection with Color Distribution Analysis

RADIOv2.5: Improved Baselines for Agglomerative Vision Foundation Models

High Temporal Consistency through Semantic Similarity Propagation in Semi-Supervised Video Semantic Segmentation for Autonomous Flight

Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding

SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation

Mixture of Submodules for Domain Adaptive Person Search

Unsupervised Discovery of Facial Landmarks and Head Pose

Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning

Stabilizing and Accelerating Autofocus with Expert Trajectory Regularized Deep Reinforcement Learning

SharpDepth: Sharpening Metric Depth Predictions Using Diffusion Distillation

NTClick: Achieving Precise Interactive Segmentation With Noise-tolerant Clicks

Repurposing Stable Diffusion Attention for Training-Free Unsupervised Interactive Segmentation

GO-N3RDet: Geometry Optimized NeRF-enhanced 3D Object Detector

EvEnhancer: Empowering Effectiveness, Efficiency and Generalizability for Continuous Space-Time Video Super-Resolution with Events

Simulator HC: Regression-based Online Simulation of Starting Problem-Solution Pairs for Homotopy Continuation in Geometric Vision

Dynamic Integration of Task-Specific Adapters for Class Incremental Learning

MoFlow: One-Step Flow Matching for Human Trajectory Forecasting via Implicit Maximum Likelihood Estimation based Distillation

DiverseFlow: Sample-Efficient Diverse Mode Coverage in Flows

Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval

MaRI: Material Retrieval Integration across Domains

GeoMM: On Geodesic Perspective for Multi-modal Learning

VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning

Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations

Breaking the Memory Barrier of Contrastive Loss via Tile-Based Strategy

RoboPEPP: Vision-Based Robot Pose and Joint Angle Estimation through Embedding Predictive Pre-Training

Universal Domain Adaptation for Semantic Segmentation

Distraction is All You Need for Multimodal Large Language Model Jailbreaking

DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves

Flexible Frame Selection for Efficient Video Reasoning

PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation

Learning to Normalize on the SPD Manifold under Bures-Wasserstein Geometry

SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation

GeoAvatar: Geometrically-Consistent Multi-Person Avatar Reconstruction from Sparse Multi-View Videos

Robust-MVTON: Learning Cross-Pose Feature Alignment and Fusion for Robust Multi-View Virtual Try-On

FreeGave: 3D Physics Learning from Dynamic Videos by Gaussian Velocity

Zero-Shot Monocular Scene Flow Estimation in the Wild

MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities

Retaining Knowledge and Enhancing Long-Text Representations in CLIP through Dual-Teacher Distillation

Optical-Flow Guided Prompt Optimization for Coherent Video Generation

MOS: Modeling Object-Scene Associations in Generalized Category Discovery

Anchor-Aware Similarity Cohesion in Target Frames Enables Predicting Temporal Moment Boundaries in 2D

Test-time Augmentation Improves Efficiency in Conformal Prediction

Breaking the Low-Rank Dilemma of Linear Attention

Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection

Embracing Collaboration Over Competition: Condensing Multiple Prompts for Visual In-Context Learning

Rethinking Reconstruction and Denoising in the Dark: New Perspective, General Architecture and Beyond

Edge-SD-SR: Low Latency and Parameter Efficient On-device Super-Resolution with Stable Diffusion via Bidirectional Conditioning

Unity in Diversity: Video Editing via Gradient-Latent Purification

Revealing Key Details to See Differences: A Novel Prototypical Perspective for Skeleton-based Action Recognition

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations

Cross-Modal Distillation for 2D/3D Multi-Object Discovery from 2D Motion

Bridge Frame and Event: Common Spatiotemporal Fusion for High-Dynamic Scene Optical Flow

EVPGS: Enhanced View Prior Guidance for Splatting-based Extrapolated View Synthesis

GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding

Inversion Circle Interpolation: Diffusion-based Image Augmentation for Data-scarce Classification

Deterministic Certification of Graph Neural Networks against Graph Poisoning Attacks with Arbitrary Perturbations

A3: Few-shot Prompt Learning of Unlearnable Examples with Cross-Modal Adversarial Feature Alignment

Adapting Pre-trained 3D Models for Point Cloud Video Understanding via Cross-frame Spatio-temporal Perception

MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations

Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks

Scene-Centric Unsupervised Panoptic Segmentation

UWAV: Uncertainty-weighted Weakly-supervised Audio-Visual Video Parsing

Mosaic of Modalities: A Comprehensive Benchmark for Multimodal Graph Learning

FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement

ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding

MITracker: Multi-View Integration for Visual Object Tracking

Dual Diffusion for Unified Image Generation and Understanding

Layered Motion Fusion: Lifting Motion Segmentation to 3D in Egocentric Videos

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing

MotiF: Making Text Count in Image Animation with Motion Focal Loss

Data Synthesis with Diverse Styles for Face Recognition via 3DMM-Guided Diffusion

Uncertainty-guided Perturbation for Image Super-Resolution Diffusion Model

Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning

StoryGPT-V: Large Language Models as Consistent Story Visualizers

InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing

Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis

Towards Human-Understandable Multi-Dimensional Concept Discovery

GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing

Perturb-and-Revise: Flexible 3D Editing with Generative Trajectories

Learning Compatible Multi-Prize Subnetworks for Asymmetric Retrieval

CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation

MetricGrids: Arbitrary Nonlinear Approximation with Elementary Metric Grids based Implicit Neural Representation

FoundationStereo: Zero-Shot Stereo Matching

MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation

Zero-Shot Styled Text Image Generation, but Make It Autoregressive

Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability

Learning Physics-Based Full-Body Human Reaching and Grasping from Brief Walking References

Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning

GPVK-VL: Geometry-Preserving Virtual Keyframes for Visual Localization under Large Viewpoint Changes

Graph Neural Network Combining Event Stream and Periodic Aggregation for Low-Latency Event-based Vision

Be More Specific: Evaluating Object-centric Realism in Synthetic Images

3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes

Correlative and Discriminative Label Grouping for Multi-Label Visual Prompt Tuning

LoRACLR: Contrastive Adaptation for Customization of Diffusion Models

ArtFormer: Controllable Generation of Diverse 3D Articulated Objects

Opportunistic Single-Photon Time of Flight

Finsler Multi-Dimensional Scaling: Manifold Learning for Asymmetric Dimensionality Reduction and Embedding

Bridging Gait Recognition and Large Language Models Sequence Modeling

Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations

SFDM: Robust Decomposition of Geometry and Reflectance for Realistic Face Rendering from Sparse-view Images

DiSRT-In-Bed: Diffusion-Based Sim-to-Real Transfer Framework for In-Bed Human Mesh Recovery

QMambaBSR: Burst Image Super-Resolution with Query State Space Model

Focal Split: Untethered Snapshot Depth from Differential Defocus

ArcPro: Architectural Programs for Structured 3D Abstraction of Sparse Points

Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis

Multi-Group Proportional Representations for Text-to-Image Models

Towards Generalizable Trajectory Prediction using Dual-Level Representation Learning and Adaptive Prompting

Seeing A 3D World in A Grain of Sand

COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts

Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis

Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Improving Personalized Search with Regularized Low-Rank Parameter Updates

A Focused Human Body Model for Accurate Anthropometric Measurements Extraction

SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

Adapting Dense Matching for Homography Estimation with Grid-based Acceleration

HyperLoRA: Parameter-Efficient Adaptive Generation for Portrait Synthesis

ACE: Anti-Editing Concept Erasure in Text-to-Image Models

EchoMatch: Partial-to-Partial Shape Matching via Correspondence Reflection

CoSDH: Communication-Efficient Collaborative Perception via Supply-Demand Awareness and Intermediate-Late Hybridization

Order-Robust Class Incremental Learning: Graph-Driven Dynamic Similarity Grouping

Hierarchical Knowledge Prompt Tuning for Multi-task Test-Time Adaptation

A Regularization-Guided Equivariant Approach for Image Restoration

LaTexBlend: Scaling Multi-concept Customized Generation with Latent Textual Blending

DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification

DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation

One Diffusion to Generate Them All

Bias for Action: Video Implicit Neural Representations with Bias Modulation

Let's Verify and Reinforce Image Generation Step by Step

All-Optical Nonlinear Diffractive Deep Network for Ultrafast Image Denoising

CoSER: Towards Consistent Dense Multiview Text-to-Image Generator for 3D Creation

HybridMQA: Exploring Geometry-Texture Interactions for Colored Mesh Quality Assessment

Self-Learning Hyperspectral and Multispectral Image Fusion via Adaptive Residual Guided Subspace Diffusion Model

SIR-DIFF: Sparse Image Sets Restoration with Multi-View Diffusion Model

StickMotion: Generating 3D Human Motions by Drawing a Stickman

Reversible Decoupling Network for Single Image Reflection Removal

Enduring, Efficient and Robust Trajectory Prediction Attack in Autonomous Driving via Optimization-Driven Multi-Frame Perturbation Framework

GLASS: Guided Latent Slot Diffusion for Object-Centric Learning

Multimodal Autoregressive Pre-training of Large Vision Encoders

UNEM: UNrolled Generalized EM for Transductive Few-Shot Learning

SASep: Saliency-Aware Structured Separation of Geometry and Feature for Open Set Learning on Point Clouds

Low-Biased General Annotated Dataset Generation

Deterministic Image-to-Image Translation via Denoising Brownian Bridge Models with Dual Approximators

ADD: Attribution-Driven Data Augmentation Framework for Boosting Image Super-Resolution

Generative Hard Example Augmentation for Semantic Point Cloud Segmentation

Toward Generalized Image Quality Assessment: Relaxing the Perfect Reference Quality Assumption

Explaining Domain Shifts in Language: Concept Erasing for Interpretable Image Classification

Hazy Low-Quality Satellite Video Restoration Via Learning Optimal Joint Degradation Patterns and Continuous-Scale Super-Resolution Reconstruction

Textured Gaussians for Enhanced 3D Scene Appearance Modeling

Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge

Global-Local Tree Search in VLMs for 3D Indoor Scene Generation

Volumetric Surfaces: Representing Fuzzy Geometries with Layered Meshes

GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

Segment Any-Quality Images with Generative Latent Space Enhancement

BG-Triangle: Bézier Gaussian Triangle for 3D Vectorization and Rendering

Multi-subject Open-set Personalization in Video Generation

MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

TKG-DM: Training-free Chroma Key Content Generation Diffusion Model

Lift3D Policy: Lifting 2D Foundation Models for Robust 3D Robotic Manipulation

On the Generalization of Handwritten Text Recognition Models

From Sparse to Dense: Camera Relocalization with Scene-Specific Detector from Feature Gaussian Splatting

CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

A Simple yet Effective Layout Token in Large Language Models for Document Understanding

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

OODD: Test-time Out-of-Distribution Detection with Dynamic Dictionary

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering

Diff2Flow: Training Flow Matching Models via Diffusion Model Alignment

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

Visual Prompting for One-shot Controllable Video Editing without Inversion

MegaSynth: Scaling Up 3D Scene Reconstruction with Synthesized Data

Prof. Robot: Differentiable Robot Rendering Without Static and Self-Collisions

Flash-Split: 2D Reflection Removal with Flash Cues and Latent Diffusion Separation

Exploring Timeline Control for Facial Motion Generation

Attention IoU: Examining Biases in CelebA using Attention Maps

Segment Any Motion in Videos

HandOS: 3D Hand Reconstruction in One Stage

Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding

DropGaussian: Structural Regularization for Sparse-view Gaussian Splatting

All-Day Multi-Camera Multi-Target Tracking

Blurred LiDAR for Sharper 3D: Robust Handheld 3D Scanning with Diffuse LiDAR and RGB

TAROT: Towards Essentially Domain-Invariant Robustness with Theoretical Justification

PatchDEMUX: A Certifiably Robust Framework for Multi-label Classifiers Against Adversarial Patches

StarVector: Generating Scalable Vector Graphics Code from Images and Text

Novel View Synthesis with Pixel-Space Diffusion Models

GASP: Gaussian Avatars with Synthetic Priors

Object Detection using Event Camera: A MoE Heat Conduction based Detector and A New Benchmark Dataset

STCOcc: Sparse Spatial-Temporal Cascade Renovation for 3D Occupancy and Scene Flow Prediction

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages

TAGA: Self-supervised Learning for Template-free Animatable Gaussian Articulated Model

MambaVO: Deep Visual Odometry Based on Sequential Matching Refinement and Training Smoothing

MLLM-as-a-Judge for Image Safety without Human Labeling

Enhancing Online Continual Learning with Plug-and-Play State Space Model and Class-Conditional Mixture of Discretization

Explaining in Diffusion: Explaining a Classifier with Diffusion Semantics

Horizon-GS: Unified 3D Gaussian Splatting for Large-Scale Aerial-to-Ground Scenes

Attention Distillation: A Unified Approach to Visual Characteristics Transfer

Instant Gaussian Stream: Fast and Generalizable Streaming of Dynamic Scene Reconstruction via Gaussian Splatting

LotusFilter: Fast Diverse Nearest Neighbor Search via a Learned Cutoff Table

DreamRelation: Bridging Customization and Relation Generation

IndoorGS: Geometric Cues Guided Gaussian Splatting for Indoor Scene Reconstruction

All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

Point-Cache: Test-time Dynamic and Hierarchical Cache for Robust and Generalizable Point Cloud Analysis

MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation

Depth-Guided Bundle Sampling for Efficient Generalizable Neural Radiance Field Reconstruction

Text Augmented Correlation Transformer For Few-shot Classification & Segmentation

DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition and Surface Reconstruction for Urban Driving Scenes

Ref-GS: Directional Factorization for 2D Gaussian Splatting

Link to the Past: Temporal Propagation for Fast 3D Human Reconstruction from Monocular Video

SVG-IR: Spatially-Varying Gaussian Splatting for Inverse Rendering

Beyond Single-Modal Boundary: Cross-Modal Anomaly Detection through Visual Prototype and Harmonization

VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

Pose Priors from Language Models

Scaling Mesh Generation via Compressive Tokenization

Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction

HyperGS: Hyperspectral 3D Gaussian Splatting

Exploring Intrinsic Normal Prototypes within a Single Image for Universal Anomaly Detection

Augmenting Perceptual Super-Resolution via Image Quality Predictors

TurboFill: Adapting Few-step Text-to-image Model for Fast Image Inpainting

Stochastic Human Motion Prediction with Memory of Action Transition and Action Characteristic

Localized Concept Erasure for Text-to-Image Diffusion Models Using Training-Free Gated Low-Rank Adaptation

Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

Exploring Simple Open-Vocabulary Semantic Segmentation

Are Images Indistinguishable to Humans Also Indistinguishable to Classifiers?

X-Dyna: Expressive Dynamic Human Image Animation

Understanding Multi-layered Transmission Matrices

GS-DiT: Advancing Video Generation with Dynamic 3D Gaussian Fields through Efficient Dense 3D Point Tracking

Hierarchical Features Matter: A Deep Exploration of Progressive Parameterization Method for Dataset Distillation

AnyMoLe: Any Character Motion In-betweening Leveraging Video Diffusion Models

Towards Optimizing Large-Scale Multi-Graph Matching in Bioimaging

SinGS: Animatable Single-Image Human Gaussian Splats with Kinematic Priors

SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing

Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding

Towards Effective and Sparse Adversarial Attack on Spiking Neural Networks via Breaking Invisible Surrogate Gradients

Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation

CoCoGaussian: Leveraging Circle of Confusion for Gaussian Splatting from Defocused Images

PS-Diffusion: Photorealistic Subject-Driven Image Editing with Disentangled Control and Attention

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding

ViKIENet: Towards Efficient 3D Object Detection with Virtual Key Instance Enhanced Network

Feature Selection for Latent Factor Models

MaIR: A Locality- and Continuity-Preserving Mamba for Image Restoration

Few-shot Implicit Function Generation via Equivariance

Towards Million-Scale Adversarial Robustness Evaluation With Stronger Individual Attacks

RSAR: Restricted State Angle Resolver and Rotated SAR Benchmark

Dual Energy-Based Model with Open-World Uncertainty Estimation for Out-of-distribution Detection

DTGBrepGen: A Novel B-rep Generative Model through Decoupling Topology and Geometry

Continuous Space-Time Video Resampling with Invertible Motion Steganography

Multi-Modal Synergistic Implicit Image Enhancement for Efficient Optical Flow Estimation

Test-Time Backdoor Detection for Object Detection Models

Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval

ProtoDepth: Unsupervised Continual Depth Completion with Prototypes

Generative Multiview Relighting for 3D Reconstruction under Extreme Illumination Variation

TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation

NoT: Federated Unlearning via Weight Negation

ParaHome: Parameterizing Everyday Home Activities Towards 3D Generative Modeling of Human-Object Interactions

Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds

DecoupledGaussian: Object-Scene Decoupling for Physics-Based Interaction

OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation

Classifier-Free Guidance Inside the Attraction Basin May Cause Memorization

Vid2Avatar-Pro: Authentic Avatar from Videos in the Wild via Universal Prior

RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings

Magma: A Foundation Model for Multimodal AI Agents

SimMotionEdit: Text-Based Human Motion Editing with Motion Similarity Prediction

SerialGen: Personalized Image Generation by First Standardization Then Personalization

SimAvatar: Simulation-Ready Avatars with Layered Hair and Clothing

From Head to Tail: Efficient Black-box Model Inversion Attack via Long-tailed Learning

Matrix3D: Large Photogrammetry Model All-in-One

Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation

Proximal Algorithm Unrolling: Flexible and Efficient Reconstruction Networks for Single-Pixel Imaging

Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera

ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models

Image Quality Assessment: From Human to Machine Preference

Context-Aware Multimodal Pretraining

Sound Bridge: Associating Egocentric and Exocentric Videos via Audio Cues

LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models

Point Clouds Meets Physics: Dynamic Acoustic Field Fitting Network for Point Cloud Understanding

ODA-GAN: Orthogonal Decoupling Alignment GAN Assisted by Weakly-supervised Learning for Virtual Immunohistochemistry Staining

Satellite to GroundScape - Large-scale Consistent Ground View Generation from Satellite Views

Faster Parameter-Efficient Tuning with Token Redundancy Reduction

A Unified Model for Compressed Sensing MRI Across Undersampling Patterns

HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers

Panorama Generation From NFoV Image Done Right

Mamba-Adaptor: State Space Model Adaptor for Visual Recognition

Robust Message Embedding via Attention Flow-Based Steganography

Task-driven Image Fusion with Learnable Fusion Loss

Compositional Targeted Multi-Label Universal Perturbations

PatchGuard: Adversarially Robust Anomaly Detection and Localization through Vision Transformers and Pseudo Anomalies

Distilling Monocular Foundation Model for Fine-grained Depth Completion

LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

Geometry-guided Online 3D Video Synthesis with Multi-View Temporal Consistency

FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning

Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion

On-Device Self-Supervised Learning of Low-Latency Monocular Depth from Only Events

LogoSP: Local-global Grouping of Superpoints for Unsupervised Semantic Segmentation of 3D Point Clouds

One is Plenty: A Polymorphic Feature Interpreter for Immutable Heterogeneous Collaborative Perception

Dynamic Updates for Language Adaptation in Visual-Language Tracking

CustAny: Customizing Anything from A Single Example

CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision

ScaleLSD: Scalable Deep Line Segment Detection Streamlined

Revisiting MAE Pre-training for 3D Medical Image Segmentation

Learning with Noisy Triplet Correspondence for Composed Image Retrieval

FlashSloth : Lightning Multimodal Large Language Models via Embedded Visual Compression

DiET-GS: Diffusion Prior and Event Stream-Assisted Motion Deblurring 3D Gaussian Splatting

Birth and Death of a Rose

DualPM: Dual Posed-Canonical Point Maps for 3D Shape and Pose Reconstruction

The Scene Language: Representing Scenes with Programs, Words, and Embeddings

CGMatch: A Different Perspective of Semi-supervised Learning

ChatHuman: Chatting about 3D Humans with Tools

Leveraging Temporal Cues for Semi-Supervised Multi-View 3D Object Detection

Reconciling Stochastic and Deterministic Strategies for Zero-shot Image Restoration using Diffusion Model in Dual

Boosting Point-Supervised Temporal Action Localization through Integrating Query Reformation and Optimal Transport

Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment

Hierarchical Flow Diffusion for Efficient Frame Interpolation

Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

Camouflage Anything: Learning to Hide using Controlled Out-painting and Representation Engineering

Test-Time Fine-Tuning of Image Compression Models for Multi-Task Adaptability

BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation

AniDoc: Animation Creation Made Easier

DynPose: Largely Improving the Efficiency of Human Pose Estimation by a Simple Dynamic Framework

Arbitrary-steps Image Super-resolution via Diffusion Inversion

Dynamic Neural Surfaces for Elastic 4D Shape Representation and Analysis

Spk2SRImgNet: Super-Resolve Dynamic Scene from Spike Stream via Motion Aligned Collaborative Filtering

ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

TaoAvatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting

Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video

DiffLocks: Generating 3D Hair from a Single Image using Diffusion Models

Hyperbolic Category Discovery

CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models

Cross-modal Causal Relation Alignment for Video Question Grounding

Diffusion Renderer: Neural Inverse and Forward Rendering with Video Diffusion Models

Harnessing Frequency Spectrum Insights for Image Copyright Protection Against Diffusion Models

Learning to Detect Objects from Multi-Agent LiDAR Scans without Manual Labels

DeepLA-Net: Very Deep Local Aggregation Networks for Point Cloud Analysis

AdaptCMVC: Robust Adaption to Incremental Views in Continual Multi-view Clustering

UA-Pose: Uncertainty-Aware 6D Object Pose Estimation and Online Object Completion with Partial References

Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval

Binarized Mamba-Transformer for Lightweight Quad Bayer HybridEVS Demosaicing

DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

PhD: A ChatGPT-Prompted Visual Hallucination Evaluation Dataset

ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate

Estimating Body and Hand Motion in an Ego‑sensed World

A Bias-Free Training Paradigm for More General AI-generated Image Detection

Evaluating Vision-Language Models as Evaluators in Path Planning

Transformers without Normalization

SGC-Net: Stratified Granular Comparison Network for Open-Vocabulary HOI Detection

Certified Human Trajectory Prediction

Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding

DFM: Differentiable Feature Matching for Anomaly Detection

PointSR: Self-Regularized Point Supervision for Drone-View Object Detection

MVDoppler-Pose: Multi-Modal Multi-View mmWave Sensing for Long-Distance Self-Occluded Human Walking Pose Estimation

Gain from Neighbors: Boosting Model Robustness in the Wild via Adversarial Perturbations Toward Neighboring Classes

De^2Gaze: Deformable and Decoupled Representation Learning for 3D Gaze Estimation

M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation

Self-Expansion of Pre-trained Models with Mixture of Adapters for Continual Learning

Brain-Inspired Spiking Neural Networks for Energy-Efficient Object Detection

Multi-Granularity Class Prototype Topology Distillation for Class-Incremental Source-Free Unsupervised Domain Adaptation

A Polarization-Aided Transformer for Image Deblurring via Motion Vector Decomposition

Language-Guided Image Tokenization for Generation

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

CocoER: Aligning Multi-Level Feature by Competition and Coordination for Emotion Recognition

Medusa: A Multi-Scale High-order Contrastive Dual-Diffusion Approach for Multi-View Clustering

Hyperbolic Uncertainty-Aware Few-Shot Incremental Point Cloud Segmentation

Enhancing Creative Generation on Stable Diffusion-based Models

Denoising Functional Maps: Diffusion Models for Shape Correspondence

GAF: Gaussian Avatar Reconstruction from Monocular Videos via Multi-view Diffusion

Free on the Fly: Enhancing Flexibility in Test-Time Adaptation with Online EM

DnLUT: Ultra-Efficient Color Image Denoising via Channel-Aware Lookup Tables

EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis

Incomplete Multi-View Multi-label Learning via Disentangled Representation and Label Semantic Embedding

EquiPose: Exploiting Permutation Equivariance for Relative Camera Pose Estimation

AutoURDF: Unsupervised Robot Modeling from Point Cloud Frames Using Cluster Registration

iSegMan: Interactive Segment-and-Manipulate 3D Gaussians

Building Vision Models upon Heat Conduction

ZeroGrasp: Zero-Shot Shape Reconstruction Enabled Robotic Grasping

PDFactor: Learning Tri-Perspective View Policy Diffusion Field for Multi-Task Robotic Manipulation

WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

Multi-modal Contrastive Learning with Negative Sampling Calibration for Phenotypic Drug Discovery

OralXrays-9: Towards Hospital-Scale Panoramic X-ray Anomaly Detection via Personalized Multi-Object Query-Aware Mining

R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning

SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis

Fancy123: One Image to High-Quality 3D Mesh Generation via Plug-and-Play Deformation

Dense Dispersed Structured Light for Hyperspectral 3D Imaging of Dynamic Scenes

Language-Guided Audio-Visual Learning for Long-Term Sports Assessment

PlanarSplatting: Accurate Planar Surface Reconstruction in 3 Minutes

GBC-Splat: Generalizable Gaussian-Based Clothed Human Digitalization under Sparse RGB Cameras

SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding

Omni-ID: Holistic Identity Representation Designed for Generative Tasks

IM-Portrait: Learning 3D-aware Video Diffusion for Photorealistic Talking Heads from Monocular VideosC

MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments

MIRE: Matched Implicit Neural Representations

AeSPa : Attention-guided Self-supervised Parallel Imaging for MRI Reconstruction

Visual Consensus Prompting for Co-Salient Object Detection

MagicArticulate: Make Your 3D Models Articulation-Ready

Boltzmann Attention Sampling for Image Analysis with Small Objects

Generalized Recorrupted-to-Recorrupted: Self-Supervised Learning Beyond Gaussian Noise

Dynamic Motion Blending for Versatile Motion Editing

Open Set Label Shift with Test Time Out-of-Distribution Reference

Action Detail Matters: Refining Video Recognition with Local Action Queries

StdGEN: Semantic-Decomposed 3D Character Generation from Single Images

DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving

MAC-Ego3D: Multi-Agent Gaussian Consensus for Real-Time Collaborative Ego-Motion and Photorealistic 3D Reconstruction

Maintaining Consistent Inter-Class Topology in Continual Test-Time Adaptation

UniVAD: A Training-free Unified Model for Few-shot Visual Anomaly Detection

FFR: Frequency Feature Rectification for Weakly Supervised Semantic Segmentation

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Reconstructing In-the-Wild Open-Vocabulary Human-Object Interactions

Sonata: Self-Supervised Learning of Reliable Point Representations

COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation

Training-free Dense-Aligned Diffusion Guidance for Modular Conditional Image Synthesis

DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation

Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy

h-Edit: Effective and Flexible Diffusion-Based Editing via Doob's h-Transform

VideoSPatS: Video SPatiotemporal Splines for Disentangled Occlusion, Appearance and Motion Modeling and Editing

DRiVE: Diffusion-based Rigging Empowers Generation of Versatile and Expressive Characters

Online Video Understanding: OVBench and VideoChat-Online

T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting

TADFormer: Task-Adaptive Dynamic TransFormer for Efficient Multi-Task Learning

A Unified Approach to Interpreting Self-supervised Pre-training Methods for 3D Point Clouds via Interactions

Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation

CamPoint: Boosting Point Cloud Segmentation with Virtual Camera

Towards Generalizable Scene Change Detection

LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions

LightLoc: Learning Outdoor LiDAR Localization at Light Speed

MERGE: Multi-faceted Hierarchical Graph-based GNN for Gene Expression Prediction from Whole Slide Histopathology Images

D^3-Human: Dynamic Disentangled Digital Human from Monocular Video

Open Ad-hoc Categorization with Contextualized Feature Learning

Accurate Differential Operators for Hybrid Neural Fields

Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection

REWIND: Real-Time Egocentric Whole-Body Motion Diffusion with Exemplar-Based Identity Conditioning

FeedEdit: Text-Based Image Editing with Dynamic Feedback Regulation

Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification

UMotion: Uncertainty-driven Human Motion Estimation from Inertial and Ultra-wideband Units

STEREO: A Two-Stage Framework for Adversarially Robust Concept Erasing from Text-to-Image Diffusion Models

GenVDM: Generating Vector Displacement Maps From a Single Image

DropoutGS: Dropping Out Gaussians for Better Sparse-view Rendering

Effective SAM Combination for Open-Vocabulary Semantic Segmentation

Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding

SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models

UniAP: Unifying Inter- and Intra-Layer Automatic Parallelism by Mixed Integer Quadratic Programming

PTDiffusion: Free Lunch for Generating Optical Illusion Hidden Pictures with Phase-Transferred Diffusion Model

ScribbleLight: Single Image Indoor Relighting with Scribbles

Attend to Not Attended: Structure-then-Detail Token Merging for Post-training DiT Acceleration

Quantization without Tears

Turbo3D: Ultra-fast Text-to-3D Generation

SUM Parts: Benchmarking Part-Level Semantic Segmentation of Urban Meshes

Higher-Order Ratio Cycles for Fast and Globally Optimal Shape Matching

Hyperdimensional Uncertainty Quantification for Multimodal Uncertainty Fusion in Autonomous Vehicles Perception

EDM: Equirectangular Projection-Oriented Dense Kernelized Feature Matching

K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences

Simpler Diffusion: 1.5 FID on ImageNet512 with Pixel-space Diffusion

Few-shot Personalized Scanpath Prediction

EZSR: Event-based Zero-Shot Recognition

FlowRAM: Grounding Flow Matching Policy with Region-Aware Mamba Framework for Robotic Manipulation

SVFR: A Unified Framework for Generalized Video Face Restoration

Dinomaly: The Less Is More Philosophy in Multi-Class Unsupervised Anomaly Detection

Decoupling Fine Detail and Global Geometry for Compressed Depth Map Super-Resolution

Test-Time Visual In-Context Tuning

Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models

SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images

MeshGen: Generating PBR Textured Mesh with Render-Enhanced Auto-Encoder and Generative Data Augmentation

GIFStream: 4D Gaussian-based Immersive Video with Feature Stream

Multi-Scale Neighborhood Occupancy Masked Autoencoder for Self-Supervised Learning in LiDAR Point Clouds

CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image

Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models?

Plug-and-Play PPO: An Adaptive Point Prompt Optimizer Making SAM Greater

Harnessing Global-Local Collaborative Adversarial Perturbation for Anti-Customization

Pippo: High-Resolution Multi-View Humans from a Single Image

EchoONE: Segmenting Multiple Echocardiography Planes in One Model

Acc3D: Accelerating Single Image to 3D Diffusion Models via Edge Consistency Guided Score Distillation

EasyHOI: Unleashing the Power of Large Models for Reconstructing Hand-Object Interactions in the Wild

MVSAnywhere: Zero-Shot Multi-View Stereo

Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception

Attribute-Missing Multi-view Graph Clustering

Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision

FruitNinja: 3D Object Interior Texture Generation with Gaussian Splatting

EnergyMoGen: Compositional Human Motion Generation with Energy-Based Diffusion Model in Latent Space

Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models

Pose-Guided Temporal Enhancement for Robust Low-Resolution Hand Reconstruction

ReDiffDet: Rotation-equivariant Diffusion Model for Oriented Object Detection

MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception

PosterO: Structuring Layout Trees to Enable Language Models in Generalized Content-Aware Layout Generation

Tiled Diffusion

BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature

UniGoal: Towards Universal Zero-shot Goal-oriented Navigation

Structure-Aware Correspondence Learning for Relative Pose Estimation

LoRA Recycle: Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models by Recycling Pre-Tuned LoRAs

PyTorchGeoNodes: Enabling Differentiable Shape Programs for 3D Shape Reconstruction

STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection

SCAP: Transductive Test-Time Adaptation via Supportive Clique-based Attribute Prompting

Contextual AD Narration with Interleaved Multimodal Sequence

FIFA: Fine-grained Inter-frame Attention for Driver's Video Gaze Estimation

MNE-SLAM: Multi-Agent Neural SLAM for Mobile Robots

RigGS: Rigging of 3D Gaussians for Modeling Articulated Objects in Videos

TensoFlow: Tensorial Flow-based Sampler for Inverse Rendering

FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

Shape Abstraction via Marching Differentiable Support Functions

Event Fields: Capturing Light Fields at High Speed, Resolution, and Dynamic Range

HyperFree: A Channel-adaptive and Tuning-free Foundation Model for Hyperspectral Remote Sensing Imagery

Generative Inbetweening through Frame-wise Conditions-Driven Video Generation

Exploring Temporally-Aware Features for Point Tracking

Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better

GBlobs: Explicit Local Structure via Gaussian Blobs for Improved Cross-Domain LiDAR-based 3D Object Detection

Detail-Preserving Latent Diffusion for Stable Shadow Removal

Scaling Down Text Encoders of Text-to-Image Diffusion Models

Floating No More: Object-Ground Reconstruction from a Single Image

CrossOver: 3D Scene Cross-Modal Alignment

SKE-Layout: Spatial Knowledge Enhanced Layout Generation with LLMs

Gaussian Eigen Models for Human Heads

Pattern Analogies: Learning to Perform Programmatic Image Edits by Analogy

4D-Fly: Fast 4D Reconstruction from a Single Monocular Video

STAR-Edge: Structure-aware Local Spherical Curve Representation for Thin-walled Edge Extraction from Unstructured Point Clouds

Complementary Advantages: Exploiting Cross-Field Frequency Correlation for NIR-Assisted Image Denoising

Boosting the Dual-Stream Architecture in Ultra-High Resolution Segmentation with Resolution-Biased Uncertainty Estimation

DiffLO: Semantic-Aware LiDAR Odometry with Diffusion-Based Refinement

pFedMxF: Personalized Federated Class-Incremental Learning with Mixture of Frequency Aggregation

Style-Editor: Text-driven Object-centric Style Editing

Transfer Your Perspective: Controllable 3D Generation from Any Viewpoint in a Driving Scene

Efficient Transfer Learning for Video-language Foundation Models

Radio Frequency Ray Tracing with Neural Object Representation for Enhanced RF Modeling

ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction

MET3R: Measuring Multi-View Consistency in Generated Images

Segmenting Maxillofacial Structures in CBCT Volumes

3D Dental Model Segmentation with Geometrical Boundary Preserving

Neuro-3D: Towards 3D Visual Decoding from EEG Signals

FastVLM: Efficient Vision Encoding for Vision Language Models

VISTA3D: A Unified Segmentation Foundation Model For 3D Medical Imaging

The Art of Deception: Color Visual Illusions and Diffusion Models

Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D Data

Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI

ZoomLDM: Latent Diffusion Model for Multi-scale Image Generation

Do Computer Vision Foundation Models Learn the Low-level Characteristics of the Human Visual System?

GaussianUDF: Inferring Unsigned Distance Functions through 3D Gaussian Splatting

Towards RAW Object Detection in Diverse Conditions

FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training

DV-Matcher: Deformation-based Non-rigid Point Cloud Matching Guided by Pre-trained Visual Features

Reasoning Mamba: Hypergraph-Guided Region Relation Calculating for Weakly Supervised Affordance Grounding

Adapter Merging with Centroid Prototype Mapping for Scalable Class-Incremental Learning

OpenSDI: Spotting Diffusion-Generated Images in the Open World

Rethinking Correspondence-based Category-Level Object Pose Estimation

Adaptive Part Learning for Fine-Grained Generalized Category Discovery: A Plug-and-Play Enhancement

Lux Post Facto: Learning Portrait Performance Relighting with Conditional Video Diffusion and a Hybrid Dataset

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

Monocular and Generalizable Gaussian Talking Head Animation

Rethinking Token Reduction with Parameter-Efficient Fine-Tuning in ViT for Pixel-Level Tasks

SVDC: Consistent Direct Time-of-Flight Video Depth Completion with Frequency Selective Fusion

Locally Orderless Images for Optimization in Differentiable Rendering

Rethinking Training for De-biasing Text-to-Image Generation: Unlocking the Potential of Stable Diffusion

FLAIR: VLM with Fine-grained Language-informed Image Representations

GG-SSMs: Graph-Generating State Space Models

STDD: Spatio-Temporal Dual Diffusion for Video Generation

Continuous Adverse Weather Removal via Degradation-Aware Distillation

Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models

ILIAS: Instance-Level Image retrieval At Scale

Exploiting Temporal State Space Sharing for Video Semantic Segmentation

DeRS: Towards Extremely Efficient Upcycled Mixture-of-Experts Models

GeoDepth: From Point-to-Depth to Plane-to-Depth Modeling for Self-Supervised Monocular Depth Estimation

SSHNet: Unsupervised Cross-modal Homography Estimation via Problem Reformulation and Split Optimization

High-fidelity 3D Object Generation from Single Image with RGBN-Volume Gaussian Reconstruction Model

Optimal Transport-Guided Source-Free Adaptation for Face Anti-Spoofing

Robust 3D Shape Reconstruction in Zero-Shot from a Single Image in the Wild

BOE-ViT: Boosting Orientation Estimation with Equivariance in Self-Supervised 3D Subtomogram Alignment

Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity

Adventurer: Optimizing Vision Mamba Architecture Designs for Efficiency

QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge

Sufficient Invariant Learning for Distribution Shift

DirectTriGS: Triplane-based Gaussian Splatting Field Representation for 3D Generation

Scene4U: Hierarchical Layered 3D Scene Reconstruction from Single Panoramic Image for Your Immerse Exploration

Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters

IterIS: Iterative Inference-Solving Alignment for LoRA Merging

ACAttack: Adaptive Cross Attacking RGB-T Tracker via Multi-Modal Response Decoupling

DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos

Efficient ANN-Guided Distillation: Aligning Rate-based Features of Spiking Neural Networks through Hybrid Block-wise Replacement

Subspace Constraint and Contribution Estimation for Heterogeneous Federated Learning

SmartEraser: Remove Anything from Images using Masked-Region Guidance

Sample- and Parameter-Efficient Auto-Regressive Image Models

LOCORE: Image Re-ranking with Long-Context Sequence Modeling

NeRFPrior: Learning Neural Radiance Field as a Prior for Indoor Scene Reconstruction

Emphasizing Discriminative Features for Dataset Distillation in Complex Scenarios

BiLoRA: Almost-Orthogonal Parameter Spaces for Continual Learning

Vid2Sim: Generalizable, Video-based Reconstruction of Appearance, Geometry and Physics for Mesh-free Simulation

SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments

Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient

AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis

Towards Training-free Anomaly Detection with Vision and Language Foundation Models

LiVOS: Light Video Object Segmentation with Gated Linear Matching

Dynamic Content Prediction with Motion-aware Priors for Blind Face Video Restoration

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

Polarized Color Screen Matting

Visual Representation Learning through Causal Intervention for Controllable Image Editing

Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis

Deformable Radial Kernel Splatting

Bayesian Prompt Flow Learning for Zero-Shot Anomaly Detection

Post-pre-training for Modality Alignment in Vision-Language Foundation Models

Efficient Event-Based Object Detection: A Hybrid Neural Network with Spatial and Temporal Attention

SynthLight: Portrait Relighting with Diffusion Model by Learning to Re-render Synthetic Faces

HUNet: Homotopy Unfolding Network for Image Compressive Sensing

HalLoc: Token-level Localization of Hallucinations for Vision Language Models

DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis

SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting

From Slow Bidirectional to Fast Autoregressive Video Diffusion Models

PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution

RainyGS: Efficient Rain Synthesis with Physically-Based Gaussian Splatting

Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis

MonoInstance: Enhancing Monocular Priors via Multi-view Instance Alignment for Neural Rendering and Reconstruction

Three-view Focal Length Recovery From Homographies

RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models

FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation

CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

Exploring Semantic Feature Discrimination for Perceptual Image Super-Resolution and Opinion-Unaware No-Reference Image Quality Assessment

Distilling Long-tailed Datasets

Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders

Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation

Incorporating Dense Knowledge Alignment into Unified Multimodal Representation Models

Geometry Field Splatting with Gaussian Surfels

Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos

PS-EIP: Robust Photometric Stereo Based on Event Interval Profile

GenPC: Zero-shot Point Cloud Completion via 3D Generative Priors

FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation

Unified Uncertainty-Aware Diffusion for Multi-Agent Trajectory Modeling

WeGen: A Unified Model for Interactive Multimodal Generation as We Chat

HRAvatar: High-Quality and Relightable Gaussian Head Avatar

Latent Drifting in Diffusion Models for Counterfactual Medical Image Synthesis

Rethinking Spiking Self-Attention Mechanism: Implementing α-XNOR Similarity Calculation in Spiking Transformers

MagicQuill: An Intelligent Interactive Image Editing System

HeMoRa: Unsupervised Heuristic Consensus Sampling for Robust Point Cloud Registration

Reducing Class-wise Confusion for Incremental Learning with Disentangled Manifolds

Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces

Boosting Adversarial Transferability through Augmentation in Hypothesis Space

AniMo: Species-Aware Model for Text-Driven Animal Motion Generation

EditAR: Unified Conditional Generation with Autoregressive Models

Instance-wise Supervision-level Optimization in Active Learning

ViiNeuS: Volumetric Initialization for Implicit Neural Surface Reconstruction of Urban Scenes with Limited Image Overlap

Model Diagnosis and Correction via Linguistic and Implicit Attribute Editing

STAA-SNN: Spatial-Temporal Attention Aggregator for Spiking Neural Networks

Knowledge Memorization and Rumination for Pre-trained Model-based Class-Incremental Learning

A Distractor-Aware Memory for Visual Object Tracking with SAM2

Activating Sparse Part Concepts for 3D Class Incremental Learning

ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding

BFANet: Revisiting 3D Semantic Segmentation with Boundary Feature Analysis

Stable Flow: Vital Layers for Training-Free Image Editing

Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval

Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning

Unlocking the Potential of Unlabeled Data in Semi-Supervised Domain Generalization

TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation

CholecTrack20: A Multi-Perspective Tracking Dataset for Surgical Tools

Visual and Semantic Prompt Collaboration for Generalized Zero-Shot Learning

Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

Human-centered Interactive Learning via MLLMs for Text-to-Image Person Re-identification

Conditional Balance: Improving Multi-Conditioning Trade-Offs in Image Generation

KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation

Mitigating Ambiguities in 3D Classification with Gaussian Splatting

Exposure-slot: Exposure-centric Representations Learning with Slot-in-Slot Attention for Region-aware Exposure Correction

Data-Free Group-Wise Fully Quantized Winograd Convolution via Learnable Scales

EdgeDiff: Edge-aware Diffusion Network for Building Reconstruction from Point Clouds

GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control

A Dataset for Semantic Segmentation in the Presence of Unknowns

HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding

DeNVeR: Deformable Neural Vessel Representations for Unsupervised Video Vessel Segmentation

DH-Set: Improving Vision-Language Alignment with Diverse and Hybrid Set-Embeddings Learning

Task-Aware Clustering for Prompting Vision-Language Models

CASP: Compression of Large Multimodal Models Based on Attention Sparsity

Towards Cost-Effective Learning: A Synergy of Semi-Supervised and Active Learning

Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing

Light Transport-aware Diffusion Posterior Sampling for Single-View Reconstruction of 3D Volumes

DyCON: Dynamic Uncertainty-aware Consistency and Contrastive Learning for Semi-supervised Medical Image Segmentation

STiL: Semi-supervised Tabular-Image Learning for Comprehensive Task-Relevant Information Exploration in Multimodal Classification

DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers

HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models

Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis

SeqMvRL: A Sequential Fusion Framework for Multi-view Representation Learning

Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language

EnvGS: Modeling View-Dependent Appearance with Environment Gaussian

BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models

NeISF++: Neural Incident Stokes Field for Polarized Inverse Rendering of Conductors and Dielectrics

Flexible Group Count Enables Hassle-Free Structured Pruning

Hunyuan-Portrait: Implicit Condition Control for Enhanced Portrait Animation

MeshArt: Generating Articulated Meshes with Structure-Guided Transformers

Non-Natural Image Understanding with Advancing Frequency-based Vision Encoders

Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens

SplatFlow: Self-Supervised Dynamic Gaussian Splatting in Neural Motion Flow Field for Autonomous Driving

Zero-shot 3D Question Answering via Voxel-based Dynamic Token Compression

AesthetiQ: Enhancing Graphic Layout Design via Aesthetic-Aware Preference Alignment of Multi-modal Large Language Models

Enhanced then Progressive Fusion with View Graph for Multi-View Clustering

FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

Adaptive Non-Uniform Timestep Sampling for Accelerating Diffusion Model Training

Explainable Saliency: Articulating Reasoning with Contextual Prioritization

Navigating the Unseen: Zero-shot Scene Graph Generation via Capsule-Based Equivariant Features

AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea

Compass Control: Multi Object Orientation Control for Text-to-Image Generation

Correcting Deviations from Normality: A Reformulated Diffusion Model for Multi-Class Unsupervised Anomaly Detection

Continuous 3D Perception Model with Persistent State

LP-Diff: Towards Improved Restoration of Real-World Degraded License Plate

Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation

DNF: Unconditional 4D Generation with Dictionary-based Neural Fields

ARM: Appearance Reconstruction Model for Relightable 3D Generation

FilmComposer: LLM-Driven Music Production for Silent Film Clips

Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

Structure-from-Motion with a Non-Parametric Camera Model

EventPSR: Surface Normal and Reflectance Estimation from Photometric Stereo Using an Event Camera

LAL: Enhancing 3D Human Motion Prediction with Latency-aware Auxiliary Learning

CASP: Consistency-aware Audio-induced Saliency Prediction Model for Omnidirectional Video

RefPose: Leveraging Reference Geometric Correspondences for Accurate 6D Pose Estimation of Unseen Objects

Shape My Moves: Text-Driven Shape-Aware Synthesis of Human Motions

Generating 3D-Consistent Videos from Unposed Internet Photos

Gazing at Rewards: Eye Movements as a Lens into Human and AI Decision-Making in Hybrid Visual Foraging

FOCUS: Knowledge-enhanced Adaptive Visual Compression for Few-shot Whole Slide Image Classification

Beyond Human Perception: Understanding Multi-Object World from Monocular View

GRAE-3DMOT: Geometry Relation-Aware Encoder for Online 3D Multi-Object Tracking

Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression

Parameter Efficient Mamba Tuning via Projector-targeted Diagonal-centric Linear Transformation

ViUniT: Visual Unit Tests for More Robust Visual Programming

LIRM: Large Inverse Rendering Model for Progressive Reconstruction of Shape, Materials and View-dependent Radiance Fields

DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations

MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation

beta-FFT: Nonlinear Interpolation and Differentiated Training Strategies for Semi-Supervised Medical Image Segmentation

Dynamic Group Normalization: Spatio-Temporal Adaptation to Evolving Data Statistics

Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models

Reasoning in Visual Navigation of End-to-end Trained Agents: A Dynamical Systems Approach

DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation

GarmentPile: Point-Level Visual Affordance Guided Retrieval and Adaptation for Cluttered Garments Manipulation

GroomLight: Hybrid Inverse Rendering for Relightable Human Hair Appearance Modeling

Improving Editability in Image Generation with Layer-wise Memory

Sea-ing in Low-light

CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning

Generative Modeling of Class Probability for Multi-Modal Representation Learning

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Simplification Is All You Need against Out-of-Distribution Overconfidence

LOD-GS: Achieving Levels of Detail using Scalable Gaussian Soup

VoteFlow: Enforcing Local Rigidity in Self-Supervised Scene Flow

The Devil is in Low-Level Features for Cross-Domain Few-Shot Segmentation

Design2GarmentCode: Turning Design Concepts to Tangible Garments Through Program Synthesis

Uncertainty Weighted Gradients for Model Calibration

Efficient Dynamic Scene Editing via 4D Gaussian-based Static-Dynamic Separation

SocialMOIF: Multi-Order Intention Fusion for Pedestrian Trajectory Prediction

FFaceNeRF: Few-shot Face Editing in Neural Radiance Fields

Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observations

HistoFS: Non-IID Histopathologic Whole Slide Image Classification via Federated Style Transfer with RoI-Preserving

Unified Medical Lesion Segmentation via Self-referring Indicator

SGSST: Scaling Gaussian Splatting Style Transfer

Learning Bijective Surface Parameterization for Inferring Signed Distance Functions from Sparse Point Clouds with Grid Deformation

Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers

Zero-shot RGB-D Point Cloud Registration with Pre-trained Large Vision Model

Balancing Two Classifiers via A Simplex ETF Structure for Model Calibration

DAMM-Diffusion: Learning Divergence-Aware Multi-Modal Diffusion Model for Nanoparticles Distribution Prediction

VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Unveiling Differences in Generative Models: A Scalable Differential Clustering Approach

CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering

U-Know-DiffPAN: An Uncertainty-aware Knowledge Distillation Diffusion Framework with Details Enhancement for PAN-Sharpening

SceneDiffuser++: City-Scale Traffic Simulation via a Generative World Model

Gaussian Splatting Feature Fields for (Privacy-Preserving) Visual Localization

Gazing Into Missteps: Leveraging Eye-Gaze for Unsupervised Mistake Detection in Egocentric Videos of Skilled Human Activities

Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport

RelationField: Relate Anything in Radiance Fields

DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding

From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration

Let Humanoids Hike! Integrative Skill Development on Complex Trails

DEIM: DETR with Improved Matching for Fast Convergence

BF-STVSR: B-Splines and Fourier---Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution

DIO: Decomposable Implicit 4D Occupancy-Flow World Model

A Flag Decomposition for Hierarchical Datasets

RCP-Bench: Benchmarking Robustness for Collaborative Perception Under Diverse Corruptions

Olympus: A Universal Task Router for Computer Vision Tasks

HERA: Hybrid Explicit Representation for Ultra-Realistic Head Avatars

Circumventing Shortcuts in Audio-visual Deepfake Detection Datasets with Unsupervised Learning

CoE: Chain-of-Explanation via Automatic Visual Concept Circuit Description and Polysemanticity Quantification

Ego4o: Egocentric Human Motion Capture and Understanding from Multi-Modal Input

Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching

FreePCA: Integrating Consistency Information across Long-short Frames in Training-free Long Video Generation via Principal Component Analysis

Hierarchical Adaptive Filtering Network for Text Image Specular Highlight Removal

Improving Semi-Supervised Semantic Segmentation with Sliced-Wasserstein Feature Alignment and Uniformity

Mind the Time: Temporally-Controlled Multi-Event Video Generation

Learning Extremely High Density Crowds as Active Matters

Audio-Visual Semantic Graph Network for Audio-Visual Event Localization

3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning

EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation

Navigation World Models

Video Motion Transfer with Diffusion Transformers

Gaussian Splatting for Efficient Satellite Image Photogrammetry

Unified Reconstruction of Static and Dynamic Scenes from Events

Automatic Spectral Calibration of Hyperspectral Images: Method, Dataset and Benchmark

Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation

Point-to-Region Loss for Semi-Supervised Point-Based Crowd Counting

Parallel Sequence Modeling via Generalized Spatial Propagation Network

Scenario Dreamer: Vectorized Latent Diffusion for Generating Driving Simulation Environments

NADER: Neural Architecture Design via Multi-Agent Collaboration

Move-in-2D: 2D-Conditioned Human Motion Generation

PoseBH: Prototypical Multi-Dataset Training Beyond Human Pose Estimation

MATCHA: Towards Matching Anything

Separation of Powers: On Segregating Knowledge from Observation in LLM-enabled Knowledge-based Visual Question Answering

Decision SpikeFormer: Spike-Driven Transformer for Decision Making

SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding

Theory-Inspired Deep Multi-View Multi-Label Learning with Incomplete Views and Noisy Labels

Fitted Neural Lossless Image Compression

EMOE: Modality-Specific Enhanced Dynamic Emotion Experts

JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration

UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting

EntityErasure: Erasing Entity Cleanly via Amodal Entity Segmentation and Completion

Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation

T-CIL: Temperature Scaling using Adversarial Perturbation for Calibration in Class-Incremental Learning

LoRA Subtraction for Drift-Resistant Space in Exemplar-Free Continual Learning

Joint Out-of-Distribution Filtering and Data Discovery Active Learning

Finding Local Diffusion Schrödinger Bridge using Kolmogorov-Arnold Network

CorrBEV: Multi-View 3D Object Detection by Correlation Learning with Multi-modal Prototypes

CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

Linear Attention Modeling for Learned Image Compression

Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

Asynchronous Collaborative Graph Representation for Frames and Events

Real-time High-fidelity Gaussian Human Avatars with Position-based Interpolation of Spatially Distributed MLPs

ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration

Self-Supervised Large Scale Point Cloud Completion for Archaeological Site Restoration

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

Thin-Shell-SfT: Fine-Grained Monocular Non-rigid 3D Surface Tracking with Neural Deformation Fields

DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception

SocialGesture: Delving into Multi-person Gesture Understanding

The PanAf-FGBG Dataset: Understanding the Impact of Backgrounds in Wildlife Behaviour Recognition

Multi-modal Topology-embedded Graph Learning for Spatially Resolved Genes Prediction from Pathology Images with Prior Gene Similarity Information

Question-Aware Gaussian Experts for Audio-Visual Question Answering

Multitwine: Multi-Object Compositing with Text and Layout Control

Adaptive Rectangular Convolution for Remote Sensing Pansharpening

Video Depth without Video Models

PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning

HumanRig: Learning Automatic Rigging for Humanoid Character in a Large Scale Dataset

UIBDiffusion: Universal Imperceptible Backdoor Attack for Diffusion Models

DiskVPS: Vanishing Point Detector via Hough Transform in a Disk Region

Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding

Towards Autonomous Micromobility through Scalable Urban Simulation

FisherTune: Fisher-Guided Robust Tuning of Vision Foundation Models for Domain Generalized Segmentation

Language-Assisted Debiasing and Smoothing for Foundation Model-Based Semi-Supervised Learning

Unbiased Video Scene Graph Generation via Visual and Semantic Dual Debiasing

Channel-wise Noise Scheduled Diffusion for Inverse Rendering in Indoor Scenes

Targeted Forgetting of Image Subgroups in CLIP Models

Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment

Enhancing Diversity for Data-free Quantization

SeqAfford: Sequential 3D Affordance Reasoning via Multimodal Large Language Model

DSV-LFS: Unifying LLM-Driven Semantic Cues with Visual Features for Robust Few-Shot Segmentation

Revisiting Generative Replay for Class Incremental Object Detection

Bridging Viewpoint Gaps: Geometric Reasoning Boosts Semantic Correspondence

DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection

Towards Unbiased and Robust Spatio-Temporal Scene Graph Generation and Anticipation

Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis

Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attack on Breast Ultrasound Images

Boosting Domain Incremental Learning: Selecting the Optimal Parameters is All You Need

COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection

FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs

EffiDec3D: An Optimized Decoder for High-Performance and Efficient 3D Medical Image Segmentation

ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary

MobileH2R: Learning Generalizable Human to Mobile Robot Handover Exclusively from Scalable and Diverse Synthetic Data

Improving Sound Source Localization with Joint Slot Attention on Image and Audio

Improved Monocular Depth Prediction Using Distance Transform Over Pre-semantic Contours with Self-supervised Neural Networks

Feature-Preserving Mesh Decimation for Normal Integration

Is this Generated Person Existed in Real-world? Fine-grained Detecting and Calibrating Abnormal Human-body

Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

DexHandDiff: Interaction-aware Diffusion Planning for Adaptive Dexterous Manipulation

VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness

ROLL: Robust Noisy Pseudo-label Learning for Multi-View Clustering with Noisy Correspondence

Memories of Forgotten Concepts

PQPP: A Joint Benchmark for Text-to-Image Prompt and Query Performance Prediction

CheXwhatsApp: A Dataset for Exploring Challenges in the Diagnosis of Chest X-rays through Mobile Devices

Degradation-Aware Feature Perturbation for All-in-One Image Restoration

ACL: Activating Capability of Linear Attention for Image Restoration

GenDeg: Diffusion-based Degradation Synthesis for Generalizable All-In-One Image Restoration

Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction

The Power of Context: How Multimodality Improves Image Super-Resolution

MARBLE: Material Recomposition and Blending in CLIP-Space

EventFly: Event Camera Perception from Ground to the Sky

Detect Any Mirrors: Boosting Learning Reliability on Large-Scale Unlabeled Data with an Iterative Data Engine

CH3Depth: Efficient and Flexible Depth Foundation Model with Flow Matching

Efficient Visual State Space Model for Image Deblurring

4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models

MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking

Enhancing 3D Gaze Estimation in the Wild using Weak Supervision with Gaze Following Labels

MotionMap: Representing Multimodality in Human Pose Forecasting

Learning-enabled Polynomial Lyapunov Function Synthesis via High-Accuracy Counterexample-Guided Framework

Factored-NeuS: Reconstructing Surfaces, Illumination, and Materials of Possibly Glossy Objects

GaussianSpa: An “Optimizing-Sparsifying” Simplification Framework for Compact and High-Quality 3D Gaussian Splatting

Sparse2DGS: Geometry-Prioritized Gaussian Splatting for Surface Reconstruction from Sparse Views

VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation

ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis

Patient-Level Anatomy Meets Scanning-Level Physics: Personalized Federated Low-Dose CT Denoising Empowered by Large Language Model

Exploiting Deblurring Networks for Radiance Fields

Rethinking Lanes and Points in Complex Scenarios for Monocular 3D Lane Detection

SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images

Discovering Fine-Grained Visual-Concept Relations by Disentangled Optimal Transport Concept Bottleneck Models

Focus-N-Fix: Region-Aware Fine-Tuning for Text-to-Image Generation

RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation

PSA-SSL: Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds

Bringing CLIP to the Clinic: Dynamic Soft Labels and Negation-Aware Learning for Medical Analysis

Label Shift Meets Online Learning: Ensuring Consistent Adaptation with Universal Dynamic Regret

A Physics-Informed Blur Learning Framework for Imaging Systems

A Semantic Knowledge Complementarity based Decoupling Framework for Semi-supervised Class-imbalanced Medical Image Segmentation

ModeSeq: Taming Sparse Multimodal Motion Prediction with Sequential Mode Modeling

Quaffure: Real-Time Quasi-Static Neural Hair Simulation

Towards Practical Real-Time Neural Video Compression

From Alexnet to Transformers: Measuring the Non-linearity of Deep Neural Networks with Affine Optimal Transport

DepthSplat: Connecting Gaussian Splatting and Depth

FedBiP: Heterogeneous One-Shot Federated Learning with Personalized Latent Diffusion Models

Dynamic Camera Poses and Where to Find Them

GCE-Pose: Global Context Enhancement for Category-level Object Pose Estimation

OmniGen: Unified Image Generation

QuCOOP: A Versatile Framework for Solving Composite and Binary-Parametrised Problems on Quantum Annealers

Calibrated Multi-Preference Optimization for Aligning Diffusion Models

Learning from Neighbors: Category Extrapolation for Long-Tail Learning

Material Anything: Generating Materials for Any 3D Object via Diffusion

TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization

ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning

Project-Probe-Aggregate: Efficient Fine-Tuning for Group Robustness

Implicit Bias Injection Attacks against Text-to-Image Diffusion Models

ROICtrl: Boosting Instance Control for Visual Generation

FRESA: Feedforward Reconstruction of Personalized Skinned Avatars from Few Images

ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning

Cropper: Vision-Language Model for Image Cropping through In-Context Learning

Advancing Adversarial Robustness in GNeRFs: The IL2-NeRF Attack

WonderWorld: Interactive 3D Scene Generation from a Single Image

A Lightweight UDF Learning Framework for 3D Reconstruction Based on Local Shape Functions

DiffCAM: Data-Driven Saliency Maps by Capturing Feature Differences

PolarNeXt: Rethink Instance Segmentation with Polar Representation

Imagine and Seek: Improving Composed Image Retrieval with an Imagined Proxy

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

SAM-REF: Introducing Image-Prompt Synergy during Interaction for Detail Enhancement in the Segment Anything Model

DarkIR: Robust Low-Light Image Restoration

R2C: Mapping Room to Chessboard to Unlock LLM As Low-Level Action Planner

ICE: Intrinsic Concept Extraction from a Single Image via Diffusion Models

ASIGN: An Anatomy-aware Spatial Imputation Graphic Network for 3D Spatial Transcriptomics

Reversing Flow for Image Restoration

Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking

FDS: Frequency-Aware Denoising Score for Text-Guided Latent Diffusion Image Editing

ROS-SAM: High-Quality Interactive Segmentation for Remote Sensing Moving Object

MultiMorph: On-demand Atlas Construction

From Prototypes to General Distributions: An Efficient Curriculum for Masked Image Modeling

Synthetic Visual Genome

Difference Inversion: Interpolate and Isolate the Difference with Token Consistency for Image Analogy Generation

Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding

ShiftwiseConv: Small Convolutional Kernel with Large Kernel Effect

RealEdit: Reddit Edits As a Large-scale Empirical Dataset for Image Transformations

MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting

Can Generative Video Models Help Pose Estimation?

DreamOmni: Unified Image Generation and Editing

Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark

PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models

DAGSM: Disentangled Avatar Generation with GS-enhanced Mesh

G3Flow: Generative 3D Semantic Flow for Pose-aware and Generalizable Object Manipulation

InsightEdit: Towards Better Instruction Following for Image Editing

Open-Canopy: Towards Very High Resolution Forest Monitoring

The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion

SINR: Sparsity Driven Compressed Implicit Neural Representations

Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models

Spiking Transformer: Introducing Accurate Addition-Only Spiking Self-Attention for Transformer

MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection Dataset for Tiny Objects

S2D-LFE: Sparse-to-Dense Light Field Event Generation

Beyond Generation: A Diffusion-based Low-level Feature Extractor for Detecting AI-generated Images

Towards Stable and Storage-efficient Dataset Distillation: Matching Convexified Trajectory

Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions

FiRe: Fixed-points of Restoration Priors for Solving Inverse Problems

MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes

Symbolic Representation for Any-to-Any Generative Tasks

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

Odd-One-Out: Anomaly Detection by Comparing with Neighbors

Dynamic Stereotype Theory Induced Micro-expression Recognition with Oriented Deformation

Charm: The Missing Piece in ViT Fine-Tuning for Image Aesthetic Assessment

Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations

SCSegamba: Lightweight Structure-Aware Vision Mamba for Crack Segmentation in Structures

MMRL: Multi-Modal Representation Learning for Vision-Language Models

Parallelized Autoregressive Visual Generation

PanDA: Towards Panoramic Depth Anything with Unlabeled Panoramas and Mobius Spatial Augmentation

JiSAM: Alleviate Labeling Burden and Corner Case Problems in Autonomous Driving via Minimal Real-World Data

One Model for ALL: Low-Level Task Interaction Is a Key to Task-Agnostic Image Fusion

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

Image is All You Need to Empower Large-scale Diffusion Models for In-Domain Generation

MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations

RandAR: Decoder-only Autoregressive Visual Generation in Random Orders

Evolving High-Quality Rendering and Reconstruction in a Unified Framework with Contribution-Adaptive Regularization

Language-Guided Salient Object Ranking

Mamba4D: Efficient 4D Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models

D^2iT: Dynamic Diffusion Transformer for Accurate Image Generation

Gradient-Guided Annealing for Domain Generalization

Inference-Scale Complexity in ANN-SNN Conversion for High-Performance and Low-Power Applications

NoiseCtrl: A Sampling-Algorithm-Agnostic Conditional Generation Method for Diffusion Models

Segment This Thing: Foveated Tokenization for Efficient Point-Prompted Segmentation

TCFG: Tangential Damping Classifier-free Guidance

AutoPresent: Designing Structured Visuals from Scratch

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models

UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation

KMD: Koopman Multi-modality Decomposition for Generalized Brain Tumor Segmentation under Incomplete Modalities

LongDiff: Training-Free Long Video Generation in One Go

Structure from Collision

BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation

LSNet: See Large, Focus Small

CADCrafter: Generating Computer-Aided Design Models from Unconstrained Images

DORNet: A Degradation Oriented and Regularized Network for Blind Depth Super-Resolution

Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing

Unsupervised Continual Domain Shift Learning with Multi-Prototype Modeling

Using Diffusion Priors for Video Amodal Segmentation

Augmented Deep Contexts for Spatially Embedded Video Coding

Towards Source-Free Machine Unlearning

Fractal Calibration for Long-tailed Object Detection

Adaptive Unimodal Regulation for Balanced Multimodal Information Acquisition

Noise Calibration and Spatial-Frequency Interactive Network for STEM Image Enhancement

Learning to Highlight Audio by Watching Movies

HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation

Decouple Distortion from Perception: Region Adaptive Diffusion for Extreme-low Bitrate Perception Image Compression

FedSPA: Generalizable Federated Graph Learning under Homophily Heterogeneity

PRaDA: Projective Radial Distortion Averaging

Kiss3DGen: Repurposing Image Diffusion Models for 3D Asset Generation

Feature Information Driven Position Gaussian Distribution Estimation for Tiny Object Detection

Homogeneous Dynamics Space for Heterogeneous Humans

GazeGene: Large-scale Synthetic Gaze Dataset with 3D Eyeball Annotations

ImViD: Immersive Volumetric Videos for Enhanced VR Engagement

Real-IAD D³: A Real-World 2D/Pseudo-3D/3D Dataset for Industrial Anomaly Detection

SURGEON: Memory-Adaptive Fully Test-Time Adaptation via Dynamic Activation Sparsity

Erase Diffusion: Empowering Object Removal Through Calibrating Diffusion Pathways

MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval

Large Self-Supervised Models Bridge the Gap in Domain Adaptive Object Detection

BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models

Shape and Texture: What Influences Reliable Optical Flow Estimation?

VideoHandles: Editing 3D Object Compositions in Videos Using Video Generative Priors

Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability

MExD: An Expert-Infused Diffusion Model for Whole-Slide Image Classification

Federated Learning with Domain Shift Eraser

A Unified Framework for Heterogeneous Semi-supervised Learning

Two is Better than One: Efficient Ensemble Defense for Robust and Compact Models

Resilient Sensor Fusion Under Adverse Sensor Failures via Multi-Modal Expert Fusion

Satellite Observations Guided Diffusion Model for Accurate Meteorological States at Arbitrary Resolution

Event-Equalized Dense Video Captioning

Multirate Neural Image Compression with Adaptive Lattice Vector Quantization

VidTwin: Video VAE with Decoupled Structure and Dynamics

Reconstructing People, Places, and Cameras

Evaluating Model Perception of Color Illusions in Photorealistic Scenes

Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing

ProReflow: Progressive Reflow with Decomposed Velocity

DVHGNN: Multi-Scale Dilated Vision HGNN for Efficient Vision Recognition

Object-aware Sound Source Localization via Audio-Visual Scene Understanding

PCM : Picard Consistency Model for Fast Parallel Sampling of Diffusion Models

Towards Precise Scaling Laws for Video Diffusion Transformers

VideoGEM: Training-free Action Grounding in Videos

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment

SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

DefMamba: Deformable Visual State Space Model

Color Alignment in Diffusion

Hand-held Object Reconstruction from RGB Video with Dynamic Interaction

DynScene: Scalable Generation of Dynamic Robotic Manipulation Scenes for Embodied AI

Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation

Towards High-fidelity 3D Talking Avatar with Personalized Dynamic Texture

GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

RobSense: A Robust Multi-modal Foundation Model for Remote Sensing with Static, Temporal, and Incomplete Data Adaptability

Joint Scheduling of Causal Prompts and Tasks for Multi-Task Learning

Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models

Unified Dense Prediction of Video Diffusion

ChatGarment: Garment Estimation, Generation and Editing via Large Language Models

SLADE: Shielding against Dual Exploits in Large Vision-Language Models

Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking

PhyS-EdiT: Physics-aware Semantic Image Editing with Text Description

Single Domain Generalization for Few-Shot Counting via Universal Representation Matching

FedAWA: Adaptive Optimization of Aggregation Weights in Federated Learning Using Client Vectors

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Embodied Scene Understanding for Vision Language Models via MetaVQA

PEER Pressure: Model-to-Model Regularization for Single Source Domain Generalization

Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks

3D-AVS: LiDAR-based 3D Auto-Vocabulary Segmentation

MODA: Motion-Drift Augmentation for Inertial Human Motion Analysis

Conical Visual Concentration for Efficient Large Vision-Language Models

Functionality Understanding and Segmentation in 3D Scenes

Less is More: Efficient Image Vectorization with Adaptive Parameterization

Sonic: Shifting Focus to Global Audio Perception in Portrait Animation

MESC-3D:Mining Effective Semantic Cues for 3D Reconstruction from a Single Image

Probing the Mid-level Vision Capabilities of Self-Supervised Learning

Style Evolving along Chain-of-Thought for Unknown-Domain Object Detection

Learning to Filter Outlier Edges in Global SfM

Revisiting Audio-Visual Segmentation with Vision-Centric Transformer

UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation

Exploring Scene Affinity for Semi-Supervised LiDAR Semantic Segmentation

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation

SpecTRe-GS: Modeling Highly Specular Surfaces with Reflected Nearby Objects by Tracing Rays in 3D Gaussian Splatting

T2SG: Traffic Topology Scene Graph for Topology Reasoning in Autonomous Driving

ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning

Interactive Medical Image Analysis with Concept-based Similarity Reasoning

ConceptGuard: Continual Personalized Text-to-Image Generation with Forgetting and Confusion Mitigation

AKiRa: Augmentation Kit on Rays for Optical Video Generation

RASP: Revisiting 3D Anamorphic Art for Shadow-Guided Packing of Irregular Objects

Exploring Historical Information for RGBE Visual Tracking with Mamba

MLVU: Benchmarking Multi-task Long Video Understanding

Rate-In: Information-Driven Adaptive Dropout Rates for Improved Inference-Time Uncertainty Estimation

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

Identifying and Mitigating Spurious Correlation in Multi-Task Learning

Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

Sketchy Bounding-box Supervision for 3D Instance Segmentation

Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions

End-to-End Implicit Neural Representations for Classification

FSboard: Over 3 Million Characters of ASL Fingerspelling Collected via Smartphones

Adversarial Domain Prompt Tuning and Generation for Single Domain Generalization

Full-DoF Egomotion Estimation for Event Cameras Using Geometric Solvers

TopoCellGen: Generating Histopathology Cell Topology with a Diffusion Model

PERSE: Personalized 3D Generative Avatars from A Single Portrait

Towards Explainable and Unprecedented Accuracy in Matching Challenging Finger Crease Patterns

Diffusion Bridge: Leveraging Diffusion Model to Reduce the Modality Gap Between Text and Vision for Zero-Shot Image Captioning

MODfinity: Unsupervised Domain Adaptation with Multimodal Information Flow Intertwining

Towards Universal Soccer Video Understanding

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

Understanding Multi-Task Activities from Single-Task Videos

Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

Efficient Video Super-Resolution for Real-time Rendering with Decoupled G-buffer Guidance

Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise

ODHSR: Online Dense 3D Reconstruction of Humans and Scenes from Monocular Videos

Animate and Sound an Image

CustomKD: Customizing Large Vision Foundation for Edge Model Improvement via Knowledge Distillation

Volume Tells: Dual Cycle-Consistent Diffusion for 3D Fluorescence Microscopy De-noising and Super-Resolution

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-image Diffusion Models

Prior-free 3D Object Tracking

The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation

Enhancing Few-Shot Class-Incremental Learning via Training-Free Bi-Level Modality Calibration

Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs

SOGS: Second-Order Anchor for Advanced 3D Gaussian Splatting

VoxelSplat: Dynamic Gaussian Splatting as an Effective Loss for Occupancy and Flow Prediction

Self-supervised ControlNet with Spatio-Temporal Mamba for Real-world Video Super-resolution

RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins

Instruction-based Image Manipulation by Watching How Things Move

LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant

Preconditioners for the Stochastic Training of Neural Fields

Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning

VinaBench: Benchmark for Faithful and Consistent Visual Narratives

MP-GUI: Modality Perception with MLLMs for GUI Understanding

Generative Video Propagation

PUP 3D-GS: Principled Uncertainty Pruning for 3D Gaussian Splatting

CLIP is Almost All You Need: Towards Parameter-Efficient Scene Text Retrieval without OCR

DreamTrack: Dreaming the Future for Multimodal Visual Object Tracking

Enhancing Privacy-Utility Trade-offs to Mitigate Memorization in Diffusion Models

Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization

Learning Textual Prompts for Open-World Semi-Supervised Learning

VITED: Video Temporal Evidence Distillation

Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning

Docopilot: Improving Multimodal Models for Document-Level Understanding

GauSTAR: Gaussian Surface Tracking and Reconstruction

Latent Space Imaging

3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion

Unraveling Normal Anatomy via Fluid-Driven Anomaly Randomization

Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding

FaithDiff: Unleashing Diffusion Priors for Faithful Image Super-resolution

Learned Image Compression with Dictionary-based Entropy Model

ONDA-Pose: Occlusion-Aware Neural Domain Adaptation for Self-Supervised 6D Object Pose Estimation

BEVDiffuser: Plug-and-Play Diffusion Model for BEV Denoising with Ground-Truth Guidance

MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D

VoCo-LLaMA: Towards Vision Compression with Large Language Models

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos

Think Small, Act Big: Primitive Prompt Learning for Lifelong Robot Manipulation

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Large Model Enhancement

LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

ETAP: Event-based Tracking of Any Point

Feature Spectrum Learning for Remote Sensing Change Detection

Free Lunch Enhancements for Multi-modal Crowd Counting

LiSu: A Dataset and Method for LiDAR Surface Normal Estimation

Learning Phase Distortion with Selective State Space Models for Video Turbulence Mitigation

Adapting to Observation Length of Trajectory Prediction via Contrastive Learning

Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model

PSBD: Prediction Shift Uncertainty Unlocks Backdoor Detection

Improve Representation for Imbalanced Regression through Geometric Constraints

Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding

GaussHDR: High Dynamic Range Gaussian Splatting via Learning Unified 3D and 2D Local Tone Mapping

Differentiable Inverse Rendering with Interpretable Basis BRDFs

Beyond Background Shift: Rethinking Instance Replay in Continual Semantic Segmentation

ShapeShifter: 3D Variations Using Multiscale and Sparse Point-Voxel Diffusion

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Integral Fast Fourier Color Constancy

MambaOut: Do We Really Need Mamba for Vision?

TreeMeshGPT: Artistic Mesh Generation with Autoregressive Tree Sequencing

Training Data Provenance Verification: Did Your Model Use Synthetic Data from My Generative Model for Training?

vesselFM: A Foundation Model for Universal 3D Blood Vessel Segmentation

DistinctAD: Distinctive Audio Description Generation in Contexts

HOT: Hadamard-based Optimized Training

AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modalities

MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation

JTD-UAV: MLLM-Enhanced Joint Tracking and Description Framework for Anti-UAV Systems

NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images

Stop Learning it all to Mitigate Visual Hallucination, Focus on the Hallucination Target.

Model Poisoning Attacks to Federated Learning via Multi-Round Consistency

STEP: Enhancing Video-LLMs’ Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training

Bridging the Vision-Brain Gap with an Uncertainty-Aware Blur Prior

VLMs-Guided Representation Distillation for Efficient Vision-Based Reinforcement Learning

GOAL: Global-local Object Alignment Learning

Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents

Multi-Modal Aerial-Ground Cross-View Place Recognition with Neural ODEs

Collaborative Tree Search for Enhancing Embodied Multi-Agent Collaboration

Continuous Locomotive Crowd Behavior Generation

LeanGaussian: Breaking Pixel or Point Cloud Correspondence in Modeling 3D Gaussians

Data Distributional Properties As Inductive Bias for Systematic Generalization

DKC: Differentiated Knowledge Consolidation for Cloth-Hybrid Lifelong Person Re-identification

See Further When Clear: Curriculum Consistency Model

CaMuViD: Calibration-Free Multi-View Detection

SpatialDreamer: Self-supervised Stereo Video Synthesis from Monocular Input

An Image-like Diffusion Method for Human-Object Interaction Detection

From Sparse Signal to Smooth Motion: Real-Time Motion Generation with Rolling Prediction Models

Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement

AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers

Adv-CPG: A Customized Portrait Generation Framework with Facial Adversarial Attacks

Shadow Generation Using Diffusion Model with Geometry Prior

Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think

GliaNet: Adaptive Neural Network Structure Learning with Glia-Driven

Visual Agentic AI for Spatial Reasoning with a Dynamic API

Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks

Hypergraph Vision Transformers: Images are More than Nodes, More than Edges

Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction

SemAlign3D: Semantic Correspondence between RGB-Images through Aligning 3D Object-Class Representations

RUBIK: A Structured Benchmark for Image Matching across Geometric Challenges

Star with Bilinear Mapping

Efficient Personalization of Quantized Diffusion Model without Backpropagation

M3GYM: A Large-Scale Multimodal Multi-view Multi-person Pose Dataset for Fitness Activity Understanding in Real-world Settings

Iterative Predictor-Critic Code Decoding for Real-World Image Dehazing

Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation

Community Forensics: Using Thousands of Generators to Train Fake Image Detectors

A Unified Latent Schrödinger Bridge Diffusion Model for Unsupervised Anomaly Detection and Localization

Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly

Beyond Image Classification: A Video Benchmark and Dual-Branch Hybrid Discrimination Framework for Compositional Zero-Shot Learning

Towards Efficient Foundation Model for Zero-shot Amodal Segmentation

Uncertain Multimodal Intention and Emotion Understanding in the Wild

OSMamba: Omnidirectional Spectral Mamba with Dual-Domain Prior Generator for Exposure Correction

Synthetic Data is an Elegant GIFT for Continual Vision-Language Models

KVQ: Boosting Video Quality Assessment via Saliency-guided Local Perception

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

ProbeSDF: Light Field Probes For Neural Surface Reconstruction

WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and Segmentation

Detecting Backdoor Attacks in Federated Learning via Direction Alignment Inspection

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions

Learning Flow Fields in Attention for Controllable Person Image Generation

Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning

Distilling Spatially-Heterogeneous Distortion Perception for Blind Image Quality Assessment

Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning

D2SP: Dynamic Dual-Stage Purification Framework for Dual Noise Mitigation in Vision-based Affective Recognition.

AniGrad: Anisotropic Gradient-Adaptive Sampling for 3D Reconstruction From Monocular Video

Progressive Focused Transformer for Single Image Super-Resolution

Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation

Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training

Rethinking Noisy Video-Text Retrieval via Relation-aware Alignment

Token Cropr: Faster ViTs for Quite a Few Tasks

ArticulatedGS: Self-supervised Digital Twin Modeling of Articulated Objects using 3D Gaussian Splatting

Detecting Out-of-Distribution Through the Lens of Neural Collapse

Doppelgangers++: Improved Visual Disambiguation with Geometric 3D Features

From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech

Learnable Infinite Taylor Gaussian for Dynamic View Rendering

FreeCloth: Free-form Generation Enhances Challenging Clothed Human Modeling

SnowMaster: Comprehensive Real-world Image Desnowing via MLLM with Multi-Model Feedback Optimization

DViN: Dynamic Visual Routing Network for Weakly Supervised Referring Expression Comprehension

Argus: A Compact and Versatile Foundation Model for Vision

Multi-View Pose-Agnostic Change Localization with Zero Labels

TSP-Mamba: The Travelling Salesman Problem Meets Mamba for Image Super-resolution and Beyond

Motions as Queries: One-Stage Multi-Person Holistic Human Motion Capture

Any6D: Model-free 6D Pose Estimation of Novel Object

DELT: A Simple Diversity-driven EarlyLate Training for Dataset Distillation

Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail

SpiritSight Agent: Advanced GUI Agent with One Look

Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects

3D Gaussian Head Avatars with Expressive Dynamic Appearances by Compact Tensorial Representations

The Illusion of Unlearning: The Unstable Nature of Machine Unlearning in Text-to-Image Diffusion Models

SaMam: Style-aware State Space Model for Arbitrary Image Style Transfer

Auto-Encoded Supervision for Perceptual Image Super-Resolution

SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model

BOOTPLACE: Bootstrapped Object Placement with Detection Transformers

Q-PART: Quasi-Periodic Adaptive Regression with Test-time Training for Pediatric Left Ventricular Ejection Fraction Regression

One2Any: One-Reference 6D Pose Estimation for Any Object

A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation

PreciseCam: Precise Camera Control for Text-to-Image Generation

EventGPT: Event Stream Understanding with Multimodal Large Language Models

Synthetic-to-Real Self-supervised Robust Depth Estimation via Learning with Motion and Structure Priors

From Zero to Detail: Deconstructing Ultra-High-Definition Image Restoration from Progressive Spectral Perspective

Rectification-specific Supervision and Constrained Estimator for Online Stereo Rectification

Event-based Video Super-Resolution via State Space Models

Making Old Film Great Again: Degradation-aware State Space Model for Old Film Restoration

Leveraging Global Stereo Consistency for Category-Level Shape and 6D Pose Estimation from Stereo Images

EfficientLLaVA: Generalizable Auto-Pruning for Large Vision-language Models

Detection-Friendly Nonuniformity Correction: A Union Framework for Infrared UAV Target Detection

Articulated Kinematics Distillation from Video Diffusion Models

MP-SfM: Monocular Surface Priors for Robust Structure-from-Motion

OnlineAnySeg: Online Zero-Shot 3D Segmentation by Visual Foundation Model Guided 2D Mask Merging

Tora: Trajectory-oriented Diffusion Transformer for Video Generation

Volumetrically Consistent 3D Gaussian Rasterization

Morpheus: Text-Driven 3D Gaussian Splat Shape and Color Stylization

CacheQuant: Comprehensively Accelerated Diffusion Models

The Impact Label Noise and Choice of Threshold has on Cross-Entropy and Soft-Dice in Image Segmentation

Open-World Objectness Modeling Unifies Novel Object Detection

LLaVA-Critic: Learning to Evaluate Multimodal Models

Repurposing Pre-trained Video Diffusion Models for Event-based Video Interpolation

MotionPRO: Exploring the Role of Pressure in Human MoCap and Beyond

DiffVsgg: Diffusion-Driven Online Video Scene Graph Generation

Large-scale Multi-view Tensor Clustering with Implicit Linear Kernels

Generalized Diffusion Detector: Mining Robust Features from Diffusion Models for Domain-Generalized Detection

Q-Eval-100K: Evaluating Visual Quality and Alignment Level for Text-to-Vision Content

Dual Focus-Attention Transformer for Robust Point Cloud Registration

Forming Auxiliary High-confident Instance-level Loss to Promote Learning from Label Proportions

Progress-Aware Video Frame Captioning

SMTPD: A New Benchmark for Temporal Prediction of Social Media Popularity

Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model

Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification

Spherical Manifold Guided Diffusion Model for Panoramic Image Generation

Learning on Model Weights using Tree Experts

Rethinking Query-based Transformer for Continual Image Segmentation

Image Reconstruction from Readout-Multiplexed Single-Photon Detector Arrays

Towards Smart Point-and-Shoot Photography

SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding

Towards Transformer-Based Aligned Generation with Self-Coherence Guidance

Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation

DART: Disease-aware Image-Text Alignment and Self-correcting Re-alignment for Trustworthy Radiology Report Generation

On the Consistency of Video Large Language Models in Temporal Comprehension

Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation

One-Minute Video Generation with Test-Time Training

Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding

Common3D: Self-Supervised Learning of 3D Morphable Models for Common Objects in Neural Feature Space

LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

EditSplat: Multi-View Fusion and Attention-Guided Optimization for View-Consistent 3D Scene Editing with 3D Gaussian Splatting

SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language

Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion

Towards Open-Vocabulary Audio-Visual Event Localization

One-shot 3D Object Canonicalization based on Geometric and Semantic Consistency

Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning

HIIF: Hierarchical Encoding based Implicit Image Function for Continuous Super-resolution

Motion Prompting: Controlling Video Generation with Motion Trajectories

VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models

ProHOC: Probabilistic Hierarchical Out-of-Distribution Classification via Multi-Depth Networks

CCIN: Compositional Conflict Identification and Neutralization for Composed Image Retrieval

CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP

OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels

VTON-HandFit: Virtual Try-on for Arbitrary Hand Pose Guided by Hand Priors Embedding

Interleaved-Modal Chain-of-Thought

Uni-Renderer: Unifying Rendering and Inverse Rendering Via Dual Stream Diffusion

Enhancing Adversarial Transferability with Checkpoints of a Single Model’s Training

POSTA: A Go-to Framework for Customized Artistic Poster Generation

VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models

O-TPT: Orthogonality Constraints for Calibrating Test-time Prompt Tuning in Vision-Language Models

Flash3D: Super-scaling Point Transformers through Joint Hardware-Geometry Locality

Analyzing the Synthetic-to-Real Domain Gap in 3D Hand Pose Estimation

Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields

Comprehensive Relighting: Generalizable and Consistent Monocular Human Relighting and Harmonization

Hyperspectral Pansharpening via Diffusion Models with Iteratively Zero-Shot Guidance

Efficient Motion-Aware Video MLLM

DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering

Zero-Shot 4D Lidar Panoptic Segmentation

MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views

Extreme Rotation Estimation in the Wild

ADU: Adaptive Detection of Unknown Categories in Black-Box Domain Adaptation

EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

Traversing Distortion-Perception Tradeoff using a Single Score-Based Generative Model

IceDiff: High Resolution and High-Quality Arctic Sea Ice Forecasting with Generative Diffusion Prior

DTOS: Dynamic Time Object Sensing with Large Multimodal Model

How to Merge Your Multimodal Models Over Time?

Identifying and Mitigating Position Bias of Multi-image Vision-Language Models

Unsupervised Foundation Model-Agnostic Slide-Level Representation Learning

Exploring CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation

UNIALIGN: Scaling Multimodal Alignment within One Unified Model

ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions

Exploration-Driven Generative Interactive Environments

Task-Agnostic Guided Feature Expansion for Class-Incremental Learning

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Let's Chorus: Partner-aware Hybrid Song-Driven 3D Head Animation

Twinner: Shining Light on Digital Twins in a Few Snaps

Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

DreamText: High Fidelity Scene Text Synthesis

MonoPlace3D: Learning 3D-Aware Object Placement for 3D Monocular Detection

HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation

ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos

ArtiFade: Learning to Generate High-quality Subject from Blemished Images

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers

GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery

On the Out-Of-Distribution Generalization of Large Multimodal Models

Test-Time Domain Generalization via Universe Learning: A Multi-Graph Matching Approach for Medical Image Segmentation

Easy-editable Image Vectorization with Multi-layer Multi-scale Distributed Visual Feature Embedding

Acquire and then Adapt: Squeezing out Text-to-Image Model for Image Restoration

DeDe: Detecting Backdoor Samples for SSL Encoders via Decoders

Towards Scalable Human-aligned Benchmark for Text-guided Image Editing

Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens

SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes

Scaling Inference Time Compute for Diffusion Models

Coeff-Tuning: A Graph Filter Subspace View for Tuning Attention-Based Large Models

MVBoost: Boost 3D Reconstruction with Multi-View Refinement

Chat-based Person Retrieval via Dialogue-Refined Cross-Modal Alignment

Category-Agnostic Neural Object Rigging

AVF-MAE++: Scaling Affective Video Facial Masked Autoencoders via Efficient Audio-Visual Self-Supervised Learning

POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation

GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model

Self-Supervised Cross-View Correspondence with Predictive Cycle Consistency

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

CryptoFace: End-to-End Encrypted Face Recognition

Relation-Rich Visual Document Generator for Visual Information Extraction

DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding

Mimic In-Context Learning for Multimodal Tasks

PromptHash: Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing Retrieval

Vision-Language Models Do Not Understand Negation

ID-Patch: Robust ID Association for Group Photo Personalization

iG-6DoF: Model-free 6DoF Pose Estimation for Unseen Object via Iterative 3D Gaussian Splatting

HyperNet Fields: Efficiently Training Hypernetworks without Ground Truth by Learning Weight Trajectories

Universal Scene Graph Generation

RICCARDO: Radar Hit Prediction and Convolution for Camera-Radar 3D Object Detection

BLADE: Single-view Body Mesh Estimation through Accurate Depth Estimation

AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization

MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation

ReCap: Better Gaussian Relighting with Cross-Environment Captures

Split Adaptation for Pre-trained Vision Transformers

SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models

SLVR: Super-Light Visual Reconstruction via Blueprint Controllable Convolutions and Exploring Feature Diversity Representation

Vision-Language Embodiment for Monocular Depth Estimation

Layered Image Vectorization via Semantic Simplification

Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking

Plug-and-Play Versatile Compressed Video Enhancement

UltraFusion: Ultra High Dynamic Imaging using Exposure Fusion

Automated Proof of Polynomial Inequalities via Reinforcement Learning

Frequency Dynamic Convolution for Dense Image Prediction

IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification

SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost

GroupMamba: Efficient Group-Based Visual State Space Model

Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent Spaces

How Do I Do That? Synthesizing 3D Hand Motion and Contacts for Everyday Interactions

Consistency Posterior Sampling for Diverse Image Synthesis

IMFine: 3D Inpainting via Geometry-guided Multi-view Refinement

ActiveGAMER: Active GAussian Mapping through Efficient Rendering

DeepCompress-ViT: Rethinking Model Compression to Enhance Efficiency of Vision Transformers at the Edge

EvOcc: Accurate Semantic Occupancy for Automated Driving Using Evidence Theory

Positive2Negative: Breaking the Information-Lossy Barrier in Self-Supervised Single Image Denoising

PGC: Physics-Based Gaussian Cloth from a Single Pose

Joint Vision-Language Social Bias Removal for CLIP

MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds

Explicit Depth-Aware Blurry Video Frame Interpolation Guided by Differential Curves

OFER: Occluded Face Expression Reconstruction

SeaLion: Semantic Part-Aware Latent Point Diffusion Models for 3D Generation

MonSter: Marry Monodepth to Stereo Unleashes Power

Toward Real-world BEV Perception: Depth Uncertainty Estimation via Gaussian Splatting

Efficient Fine-Tuning and Concept Suppression for Pruned Diffusion Models

WildGS-SLAM: Monocular Gaussian Splatting SLAM in Dynamic Environments

A Tale of Two Classes: Adapting Supervised Contrastive Learning to Binary Imbalanced Datasets

DEAL: Data-Efficient Adversarial Learning for High-Quality Infrared Imaging

RePerformer: Immersive Human-centric Volumetric Videos from Playback to Photoreal Reperformance

CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning

Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method

BimArt: A Unified Approach for the Synthesis of 3D Bimanual Interaction with Articulated Objects

Open-World Amodal Appearance Completion

CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models

RivuletMLP: An MLP-based Architecture for Efficient Compressed Video Quality Enhancement

OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints

Show and Tell: Visually Explainable Deep Neural Nets via Spatially-Aware Concept Bottleneck Models

Reanimating Images using Neural Representations of Dynamic Stimuli

Visual-Instructed Degradation Diffusion for All-in-One Image Restoration

Insightful Instance Features for 3D Instance Segmentation

Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene

Knowledge Bridger: Towards Training-Free Missing Modality Completion

EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing

DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

TexGarment: Consistent Garment UV Texture Generation via Efficient 3D Structure-Guided Diffusion Transformer

A Hubness Perspective on Representation Learning for Graph-Based Multi-View Clustering

Spatial-Temporal Graph Diffusion Policy with Kinematic Modeling for Bimanual Robotic Manipulation

Semi-Supervised State-Space Model with Dynamic Stacking Filter for Real-World Video Deraining

Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector

VSNet: Focusing on the Linguistic Characteristics of Sign Language

Active Hyperspectral Imaging Using an Event Camera

Bridging the Gap between Gaussian Diffusion Models and Universal Quantization for Image Compression

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Multi-modal Medical Diagnosis via Large-small Model Collaboration

SAMBLE: Shape-Specific Point Cloud Sampling for an Optimal Trade-Off Between Local Detail and Global Uniformity

Gen3DEval: Using vLLMs for Automatic Evaluation of Generated 3D Objects

SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling

AdaDARE-gamma: Balancing Stability and Plasticity in Multi-modal LLMs through Efficient Adaptation

LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

GaPT-DAR: Category-level Garments Pose Tracking via Integrated 2D Deformation and 3D Reconstruction

ABC-Former: Auxiliary Bimodal Cross-domain Transformer with Interactive Channel Attention for White Balance

Fingerprinting Denoising Diffusion Probabilistic Models

Re-thinking Temporal Search for Long-Form Video Understanding

InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception

When Domain Generalization meets Generalized Category Discovery: An Adaptive Task-Arithmetic Driven Approach

CSC-PA: Cross-image Semantic Correlation via Prototype Attentions for Single-network Semi-supervised Breast Tumor Segmentation

BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence

Query Efficient Black-Box Visual Prompting with Subspace Learning

VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving

Detecting Adversarial Data Using Perturbation Forgery

CoA: Towards Real Image Dehazing via Compression-and-Adaptation

NightAdapter: Learning a Frequency Adapter for Generalizable Night-time Scene Segmentation

TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model

Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction

Learned Binocular-Encoding Optics for RGBD Imaging Using Joint Stereo and Focus Cues

Dual-view X-ray Detection: Can AI Detect Prohibited Items from Dual-view X-ray Images like Humans?

MobilePortrait: Real-Time One-Shot Neural Head Avatars on Mobile Devices

D^3: Scaling Up Deepfake Detection by Learning from Discrepancy

Jailbreaking the Non-Transferable Barrier via Test-Time Data Disguising

Light3R-SfM: Towards Feed-forward Structure-from-Motion

Robotic Visual Instruction

Solving Instance Detection from an Open-World Perspective

Percept, Memory, and Imagine: World Feature Simulating for Open-Domain Unknown Object Detection

Efficient Depth Estimation for Unstable Stereo Camera Systems on AR Glasses

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

LiDAR-RT: Gaussian-based Ray Tracing for Dynamic LiDAR Re-simulation

Generative Zero-Shot Composed Image Retrieval

Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors

Cross-modal Information Flow in Multimodal Large Language Models

Consistent and Controllable Image Animation with Motion Diffusion Models

Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Mutimodal Models

Omnidirectional Multi-Object Tracking

Potential Field Based Deep Metric Learning

Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data

Directional Label Diffusion Model for Learning from Noisy Labels

AA-CLIP: Enhancing Zero-Shot Anomaly Detection via Anomaly-Aware CLIP

HybridGS: Decoupling Transients and Statics with 2D and 3D Gaussian Splatting

Keyframe-Guided Creative Video Inpainting

Channel Consistency Prior and Self-Reconstruction Strategy Based Unsupervised Image Deraining

MobileMamba: Lightweight Multi-Receptive Visual Mamba Network

SimLTD: Simple Supervised and Semi-Supervised Long-Tailed Object Detection

EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues

Learning Endogenous Attention for Incremental Object Detection

StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation

HyperSeg: Hybrid Segmentation Assistant with Fine-grained Visual Perceiver

Diffusion-based Event Generation for High-Quality Image Deblurring

Video Summarization with Large Language Models

Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback

Consistency-aware Self-Training for Iterative-based Stereo Matching

Balanced Rate-Distortion Optimization in Learned Image Compression

Bridge the Gap: From Weak to Full Supervision for Temporal Action Localization with PseudoFormer

HomoGen: Enhanced Video Inpainting via Homography Propagation and Diffusion

Generalized Few-shot 3D Point Cloud Segmentation with Vision-Language Model

Online Task-Free Continual Learning via Dynamic Expansionable Memory Distribution

Seeing is Not Believing: Adversarial Natural Object Optimization for Hard-Label 3D Scene Attacks

Once-Tuning-Multiple-Variants: Tuning Once and Expanded as Multiple Vision-Language Model Variants

Rethinking Epistemic and Aleatoric Uncertainty for Active Open-Set Annotation: An Energy-Based Approach

OffsetOPT: Explicit Surface Reconstruction without Normals

Devil is in the Detail: Towards Injecting Fine Details of Image Prompt in Image Generation via Conflict-free Guidance and Stratified Attention

Noise Modeling in One Hour: Minimizing Preparation Efforts for Self-supervised Low-Light RAW Image Denoising

SfM-Free 3D Gaussian Splatting via Hierarchical Training

Heterogeneous Skeleton-Based Action Representation Learning

Point Cloud Upsampling Using Conditional Diffusion Module with Adaptive Noise Suppression

Beyond Clean Training Data: A Versatile and Model-Agnostic Framework for Out-of-Distribution Detection with Contaminated Training Data

Cubify Anything: Scaling Indoor 3D Object Detection

CTRL-D: Controllable Dynamic 3D Scene Editing with Personalized 2D Diffusion

Scale Efficient Training for Large Datasets

Recurrent Feature Mining and Keypoint Mixup Padding for Category-Agnostic Pose Estimation

Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes

Minimizing Labeled, Maximizing Unlabeled: An Image-Driven Approach for Video Instance Segmentation

Do Visual Imaginations Improve Vision-and-Language Navigation Agents?

MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds

MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking

IncEventGS: Pose-Free Gaussian Splatting from a Single Event Camera

Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image Segmentation

MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining

Relation3D : Enhancing Relation Modeling for Point Cloud Instance Segmentation

Style Quantization for Data-Efficient GAN Training

FASTer: Focal token Acquiring-and-Scaling Transformer for Long-term 3D Objection Detection

SDGOCC: Semantic and Depth-Guided Bird's-Eye View Transformation for 3D Multimodal Occupancy Prediction

Cheb-GR: Rethinking K-nearest Neighbor Search in Re-ranking for Person Re-identification

FIction: 4D Future Interaction Prediction from Video

SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving

Seeking Consistent Flat Minima for Better Domain Generalization via Refining Loss Landscapes

Ev-3DOD: Pushing the Temporal Boundaries of 3D Object Detection with Event Cameras

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

AnyMap: Learning a General Camera Model for Structure-from-Motion with Unknown Distortion in Dynamic Scenes

Leveraging 3D Geometric Priors in 2D Rotation Symmetry Detection

3D-HGS: 3D Half-Gaussian Splatting

Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization

Plug-and-Play Interpretable Responsible Text-to-Image Generation via Dual-Space Multi-facet Concept Control

Boost Your Human Image Generation Model via Direct Preference Optimization

Learning Physics From Video: Unsupervised Physical Parameter Estimation for Continuous Dynamical Systems

DeClotH: Decomposable 3D Cloth and Human Body Reconstruction from a Single Image

DepthCues: Evaluating Monocular Depth Perception in Large Vision Models

TailedCore: Few-Shot Sampling for Unsupervised Long-Tail Noisy Anomaly Detection

Generalized Gaussian Entropy Model for Point Cloud Attribute Compression with Dynamic Likelihood Intervals

Learning to Sample Effective and Diverse Prompts for Text-to-Image Generation

Beyond Local Sharpness: Communication-Efficient Global Sharpness-aware Minimization for Federated Learning

F^3OCUS - Federated Finetuning of Vision-Language Foundation Models with Optimal Client Layer Updating Strategy via Multi-objective Meta-Heuristics

ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval

Text Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps

ReRAW: RGB-to-RAW Image Reconstruction via Stratified Sampling for Efficient Object Detection on the Edge

Stop Walking in Circles! Bailing Out Early in Projected Gradient Descent

LoKi: Low-dimensional KAN for Efficient Fine-tuning Image Models

DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Videos

Fortifying Federated Learning Towards Trustworthiness via Auditable Data Valuation and Verifiable Client Contribution

Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices

AdMiT: Adaptive Multi-Source Tuning in Dynamic Environments

Notes-guided MLLM Reasoning: Enhancing MLLM with Knowledge and Visual Notes for Visual Question Answering

PLeaS - Merging Models with Permutations and Least Squares

Context-Enhanced Memory-Refined Transformer for Online Action Detection

Multi-modal Knowledge Distillation-based Human Trajectory Forecasting

MatAnyone: Stable Video Matting with Consistent Memory Propagation

HORP: Human-Object Relation Priors Guided HOI Detection

InteractVLM: 3D Interaction Reasoning from 2D Foundational Models

Golden Cudgel Network for Real-Time Semantic Segmentation

Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-invariant Representation Learning

Black Hole-Driven Identity Absorbing in Diffusion Models

Toward Robust Neural Reconstruction from Sparse Point Sets

UniPhy: Learning a Unified Constitutive Model for Inverse Physics Simulation

DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

Gyro-based Neural Single Image Deblurring

HSI: A Holistic Style Injector for Arbitrary Style Transfer

Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction

Reconstructing Animals and the Wild

Navigating Image Restoration with VAR’s Distribution Alignment Prior

A General Adaptive Dual-level Weighting Mechanism for Remote Sensing Pansharpening

Controllable Human Image Generation with Personalized Multi-Garments

What’s in the Image? A Deep-Dive into the Vision of Vision Language Models

Just Dance with pi! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection

Neuron: Learning Context-Aware Evolving Representations for Zero-Shot Skeleton Action Recognition

GIF: Generative Inspiration for Face Recognition at Scale

Pos3R: 6D Pose Estimation for Unseen Objects Made Easy

Two by Two: Learning Multi-Task Pairwise Objects Assembly for Generalizable Robot Manipulation

Co-op: Correspondence-based Novel Object Pose Estimation

Tripartite Weight-Space Ensemble for Few-Shot Class-Incremental Learning

ALIEN: Implicit Neural Representations for Human Motion Prediction under Arbitrary Latency

Seurat: From Moving Points to Depth

Do ImageNet-trained Models Learn Shortcuts? The Impact of Frequency Shortcuts on Generalization

NexusGS: Sparse View Synthesis with Epipolar Depth Priors in 3D Gaussian Splatting

Decentralized Diffusion Models

CheckManual: A New Challenge and Benchmark for Manual-based Appliance Manipulation

Forensic Self-Descriptions Are All You Need for Zero-Shot Detection, Open-Set Source Attribution, and Clustering of AI-generated Images

CARL: A Framework for Equivariant Image Registration

Autoregressive Distillation of Diffusion Transformers

FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute

CrossSDF: 3D Reconstruction of Thin Structures From Cross-Sections

POT: Prototypical Optimal Transport for Weakly Supervised Semantic Segmentation

Perceptual Inductive Bias Is What You Need Before Contrastive Learning

Pseudo Visible Feature Fine-Grained Fusion for Thermal Object Detection

TinyFusion: Diffusion Transformers Learned Shallow

NSD-Imagery: A Benchmark Dataset for Extending fMRI Vision Decoding Methods to Mental Imagery

Poly-Autoregressive Prediction for Modeling Interactions

ExpertAF: Expert Actionable Feedback from Video

ForestLPR: LiDAR Place Recognition in Forests Attentioning Multiple BEV Density Images

MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism

LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living

3D-MVP: 3D Multiview Pretraining for Manipulation

TIDE: Training Locally Interpretable Domain Generalization Models Enables Test-time Correction

InterDyn: Controllable Interactive Dynamics with Video Diffusion Models

Concept Lancet: Image Editing with Compositional Representation Transplant

Zero-Shot Novel View and Depth Synthesis with Multi-View Geometric Diffusion

Instant3dit: Multiview Inpainting for Fast Editing of 3D Objects

DiSciPLE: Learning Interpretable Programs for Scientific Visual Discovery

Preserving Clusters in Prompt Learning for Unsupervised Domain Adaptation

Noise-Resistant Video Anomaly Detection via RGB Error-Guided Multiscale Predictive Coding and Dynamic Memory

Investigating the Role of Weight Decay in Enhancing Nonconvex SGD

MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts

DeSplat: Decomposed Gaussian Splatting for Distractor-Free Rendering

EDCFlow: Exploring Temporally Dense Difference Maps for Event-based Optical Flow Estimation

Dual Prompting Image Restoration with Diffusion Transformers

UVGS: Reimagining Unstructured 3D Gaussian Splatting using UV Mapping

Z-Magic: Zero-shot Multiple Attributes Guided Image Creator

DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles

PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability

MonoTAKD: Teaching Assistant Knowledge Distillation for Monocular 3D Object Detection

Foveated Instance Segmentation

Zero-Shot Head Swapping in Real-World Scenarios

Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors

Sampling Innovation-Based Adaptive Compressive Sensing

Scalable Autoregressive Monocular Depth Estimation

FreeScene: Mixed Graph Diffusion for 3D Scene Synthesis from Free Prompts

MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction

Dual-Agent Optimization framework for Cross-Domain Few-Shot Segmentation

HSI-GPT: A General-Purpose Large Scene-Motion-Language Model for Human Scene Interaction

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

Link-based Contrastive Learning for One-Shot Unsupervised Domain Adaptation

SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes

MC^2: Multi-concept Guidance for Customized Multi-concept Generation

SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation

DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning

RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete

RayFlow: Instance-Aware Diffusion Acceleration via Adaptive Flow Trajectories

Spatiotemporal Decoupling for Efficient Vision-Based Occupancy Forecasting

Diffusion Model is Effectively Its Own Teacher

AVQACL: A Novel Benchmark for Audio-Visual Question Answering Continual Learning

Science-T2I: Addressing Scientific Illusions in Image Synthesis

EASEMVC:Efficient Dual Selection Mechanism for Deep Multi-View Clustering

StyleSSP: Sampling StartPoint Enhancement for Training-free Diffusion-based Method for Style Transfer

SoftShadow: Leveraging Soft Masks for Penumbra-Aware Shadow Removal

Hearing Anywhere in Any Environment

Parameterized Blur Kernel Prior Learning for Local Motion Deblurring

Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map

AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation

Blurry-Edges: Photon-Limited Depth Estimation from Defocused Boundaries

IDEA-Bench: How Far are Generative Models from Professional Designing?

Enhancing Dataset Distillation via Non-Critical Region Refinement

Logits DeConfusion with CLIP for Few-Shot Learning

When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning

DEFOM-Stereo: Depth Foundation Model Based Stereo Matching

Minding Fuzzy Regions: A Data-driven Alternating Learning Paradigm for Stable Lesion Segmentation

LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene Relighting

Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection

Generative Sparse-View Gaussian Splatting

ProjAttacker: A Configurable Physical Adversarial Attack for Face Recognition via Projector

Reasoning to Attend: Try to Understand How Token Works

UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines

Flow-NeRF: Joint Learning of Geometry, Poses, and Dense Flow within Unified Neural Representations

AMR-Transformer: Enabling Efficient Long-range Interaction for Complex Neural Fluid Simulation

Type-R: Automatically Retouching Typos for Text-to-Image Generation

HoGS: Unified Near and Far Object Reconstruction via Homogeneous Gaussian Splatting

Multi-Label Prototype Visual Spatial Search for Weakly Supervised Semantic Segmentation

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Image Referenced Sketch Colorization Based on Animation Creation Workflow

The Change You Want To Detect: Semantic Change Detection In Earth Observation With Hybrid Data Generationf

RoboSense: Large-scale Dataset and Benchmark for Egocentric Robot Perception and Navigation in Crowded and Unstructured Environments

High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model

TimeTracker: Event-based Continuous Point Tracking for Video Frame Interpolation with Non-linear Motion

OccMamba: Semantic Occupancy Prediction with State Space Models

Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion

Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models

ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models

One-Way Ticket: Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning

Trajectory Mamba: Efficient Attention-Mamba Forecasting Model Based on Selective SSM

DiC: Rethinking Conv3x3 Designs in Diffusion Models

InteractionMap: Improving Online Vectorized HDMap Construction with Interaction

S2Gaussian: Sparse-View Super-Resolution 3D Gaussian Splatting

Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics

Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment

OmniStyle: Filtering High Quality Style Transfer Data at Scale

Towards Improved Text-Aligned Codebook Learning: Multi-Hierarchical Codebook-Text Alignment with Long Text

StyleMaster: Stylize Your Video with Artistic Generation and Translation

Overcoming Shortcut Problem in VLM for Robust Out-of-Distribution Detection

Divide and Conquer: Heterogeneous Noise Integration for Diffusion-based Adversarial Purification

MonoDGP: Monocular 3D Object Detection with Decoupled-Query and Geometry-Error Priors

PCDreamer: Point Cloud Completion Through Multi-view Diffusion Priors

One-Step Event-Driven High-Speed Autofocus

FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation

APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers

Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows

Effective Cloud Removal for Remote Sensing Images by an Improved Mean-Reverting Denoising Model with Elucidated Design Space

Towards Satellite Image Road Graph Extraction: A Global-Scale Dataset and A Novel Method

DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes

VideoGigaGAN: Towards Detail-rich Video Super-Resolution

Learning Visual Composition through Improved Semantic Guidance

High Dynamic Range Video Compression: A Large-Scale Benchmark Dataset and A Learned Bit-depth Scalable Compression Algorithm

Adaptive Parameter Selection for Tuning Vision-Language Models

DL2G: Degradation-guided Local-to-Global Restoration for Eyeglass Reflection Removal

UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping

RestorGS: Depth-aware Gaussian Splatting for Efficient 3D Scene Restoration

ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning

V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents

UNICL-SAM: Uncertainty-Driven In-Context Segmentation with Part Prototype Discovery

Efficient Decoupled Feature 3D Gaussian Splatting via Hierarchical Compression

ControlFace: Harnessing Facial Parametric Control for Face Rigging

FinePhys: Fine-grained Human Action Generation by Explicitly Incorporating Physical Laws for Effective Skeletal Guidance

An End-to-End Robust Point Cloud Semantic Segmentation Network with Single-Step Conditional Diffusion Models

RENO: Real-Time Neural Compression for 3D LiDAR Point Clouds

COB-GS: Clear Object Boundaries in 3DGS Segmentation Based on Boundary-Adaptive Gaussian Splitting

BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing

Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception

ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices

MoST: Efficient Monarch Sparse Tuning for 3D Representation Learning

Where the Devil Hides: Deepfake Detectors Can No Longer Be Trusted

From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing

LaVin-DiT: Large Vision Diffusion Transformer

Enhancing Testing-Time Robustness for Trusted Multi-View Classification in the Wild

Accelerating Diffusion Transformer via Increment-Calibrated Caching with Channel-Aware Singular Value Decomposition

GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill

Feat2GS: Probing Visual Foundation Models with Gaussian Splatting

STEPS: Sequential Probability Tensor Estimation for Text-to-Image Hard Prompt Search

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes

Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation

PromptHMR: Promptable Human Mesh Recovery

RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness

RoomPainter: View-Integrated Diffusion for Consistent Indoor Scene Texturing

MBQ: Modality-Balanced Quantization for Large Vision-Language Models

VTON 360: High-Fidelity Virtual Try-On from Any Viewing Direction

Associative Transformer

USP-Gaussian: Unifying Spike-based Image Reconstruction, Pose Correction and Gaussian Splatting

Sparse Point Cloud Patches Rendering via Splitting 2D Gaussians

SGCR: Spherical Gaussians for Efficient 3D Curve Reconstruction

EdgeMovingNet: Edge-preserving Point Cloud Reconstruction via Joint Geometry Features

VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding

Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos

Active Data Curation Effectively Distills Large-Scale Multimodal Models

Attraction Diminishing and Distributing for Few-Shot Class-Incremental Learning

InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions

InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

A Simple Data Augmentation for Feature Distribution Skewed Federated Learning

Track Any Anomalous Object:A Granular Video Anomaly Detection Pipeline

Diffusion Self-Distillation for Zero-Shot Customized Image Generation

Soft Self-labeling and Potts Relaxations for Weakly-supervised Segmentation

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

HiFi-Portrait: Zero-shot Identity-preserved Portrait Generation with High-fidelity Multi-face Fusion

ZeroVO: Visual Odometry with Minimal Assumptions

Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation

Domain Generalization in CLIP via Learning with Diverse Text Prompts

MambaIC: State Space Models for High-Performance Learned Image Compression

Six-CD: Benchmarking Concept Removals for Text-to-image Diffusion Models

Alias-Free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space

Gaussian Splashing: Unified Particles for Versatile Motion Synthesis and Rendering

NoPain: No-box Point Cloud Attack via Optimal Transport Singular Boundary

Multi-focal Conditioned Latent Diffusion for Person Image Synthesis

HVI: A New Color Space for Low-light Image Enhancement

ECVC: Exploiting Non-Local Correlations in Multiple Frames for Contextual Video Compression

Learning Dynamic Collaborative Network for Semi-supervised 3D Vessel Segmentation

Weakly Supervised Contrastive Adversarial Training for Learning Robust Features from Semi-supervised Data

Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model

ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models

UMFN: Unified Multi-Domain Face Normalization for Joint Cross-domain Prototype Learning and Heterogeneous Face Recognition

Graph-Embedded Structure-Aware Perceptual Hashing for Neural Network Protection and Piracy Detection

TopNet: Transformer-Efficient Occupancy Prediction Network for Octree-Structured Point Cloud Geometry Compression

DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval

BHViT: Binarized Hybrid Vision Transformer

GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction

Improving Accuracy and Calibration via Differentiated Deep Mutual Learning

BooW-VTON: Boosting In-the-Wild Virtual Try-On via Mask-Free Pseudo Data Training

UniScene: Unified Occupancy-centric Driving Scene Generation

Digital Twin Catalog: A Large-Scale Photorealistic 3D Object Digital Twin Dataset

Visual Persona: Foundation Model for Full-Body Human Customization

Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation

Multi-Sensor Object Anomaly Detection: Unifying Appearance, Geometry, and Internal Properties

Blood Flow Speed Estimation with Optical Coherence Tomography Angiography Images

SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts

NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval

EnvPoser: Environment-aware Realistic Human Motion Estimation from Sparse Observations with Uncertainty Modeling

K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs

HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator

CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI

CarPlanner: Consistent Auto-regressive Trajectory Planning for Large-Scale Reinforcement Learning in Autonomous Driving

StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models

FreeTimeGS: Free Gaussian Primitives at Anytime Anywhere for Dynamic Scene Reconstruction

DashGaussian: Optimizing 3D Gaussian Splatting in 200 Seconds

Classic Video Denoising in a Machine Learning World: Robust, Fast, and Controllable

Words or Vision: Do Vision-Language Models Have Blind Faith in Text?

Steepest Descent Density Control for Compact 3D Gaussian Splatting

ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model

PoseTraj: Pose-Aware Trajectory Control in Video Diffusion

UCM-VeID V2: A Richer Dataset and A Pre-training Method for UAV Cross-Modality Vehicle Re-Identification

StageDesigner: Artistic Stage Generation for Scenography via Theater Scripts

MaskGaussian: Adaptive 3D Gaussian Representation from Probabilistic Masks

Unboxed: Geometrically and Temporally Consistent Video Outpainting

Less is More: Efficient Model Merging with Binary Task Switch

Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video Diffusion Transformer

Forensics Adapter: Adapting CLIP for Generalizable Face Forgery Detection

Unlearning through Knowledge Overwriting: Reversible Federated Unlearning via Selective Sparse Adapter

Audio-Visual Instance Segmentation

Improving the Transferability of Adversarial Attacks on Face Recognition with Diverse Parameters Augmentation

GraphMimic: Graph-to-Graphs Generative Modeling from Videos for Policy Learning

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Implicit Correspondence Learning for Image-to-Point Cloud Registration

Visual Lexicon: Rich Image Features in Language Space

nnWNet: Rethinking the Use of Transformers in Biomedical Image Segmentation and Calling for a Unified Evaluation Benchmark

GuardSplat: Efficient and Robust Watermarking for 3D Gaussian Splatting

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

Continual SFT Matches Multimodal RLHF with Negative Supervision

Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation

Neural Video Compression with Context Modulation

AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer

Learning Class Prototypes for Unified Sparse-Supervised 3D Object Detection

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

OPTICAL: Leveraging Optimal Transport for Contribution Allocation in Dataset Distillation

Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key

Cross-Modal 3D Representation with Multi-View Images and Point Clouds

SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Bridging Modalities: Improving Universal Multimodal Retrieval by Multimodal Large Language Models

GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmentation

DeformCL: Learning Deformable Centerline Representation for Vessel Extraction in 3D Medical Image

Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model

PrEditor3D: Fast and Precise 3D Shape Editing

Co-Speech Gesture Video Generation with Implicit Motion-Audio Entanglement

VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge

Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward

FATE: Full-head Gaussian Avatar with Textural Editing from Monocular Video

LMO: Linear Mamba Operator for MRI Reconstruction

Mimir: Improving Video Diffusion Models for Precise Text Understanding

From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons

Improved Video VAE for Latent Video Diffusion Model

Towards Continual Universal Segmentation

Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

Your Scale Factors are My Weapon: Targeted Bit-Flip Attacks on Vision Transformers via Scale Factor Manipulation

HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation

Adaptive Dropout: Unleashing Dropout across Layers for Generalizable Image Super-Resolution

MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

EasyCraft: A Robust and Efficient Framework for Automatic Avatar Crafting

World-consistent Video Diffusion with Explicit 3D Modeling

CaricatureBooth: Data-Free Interactive Caricature Generation in a Photo Booth

GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction

Generative Gaussian Splatting for Unbounded 3D City Generation

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

WildAvatar: Learning In-the-wild 3D Avatars from the Web

SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

MAGE : Single Image to Material-Aware 3D via the Multi-View G-Buffer Estimation Model

Decoupled Motion Expression Video Segmentation

Incremental Object Keypoint Learning

Completion as Enhancement: A Degradation-Aware Selective Image Guided Network for Depth Completion

StableAnimator: High-Quality Identity-Preserving Human Image Animation

From Poses to Identity: Training-Free Person Re-Identification via Feature Centralization

Tightening Robustness Verification of MaxPool-based Neural Networks via Minimizing the Over-Approximation Zone

Human Motion Instruction Tuning

MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling

Boost the Inference with Co-training: A Depth-guided Mutual Learning Framework for Semi-supervised Medical Polyp Segmentation

One-for-More: Continual Diffusion Model for Anomaly Detection

BiM-VFI: Bidirectional Motion Field-Guided Frame Interpolation for Video with Non-uniform Motions

Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation

VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

FG^2: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching

ProbPose: A Probabilistic Approach to 2D Human Pose Estimation

CountLLM: Towards Generalizable Repetitive Action Counting via Large Language Model

SapiensID: Foundation for Human Recognition

BADGR: Bundle Adjustment Diffusion Conditioned by Gradients for Wide-Baseline Floor Plan Reconstruction

Three Cars Approaching within 100m! Enhancing Distant Geometry by Tri-Axis Voxel Scanning for Camera-based Semantic Scene Completion

MetaShadow: Object-Centered Shadow Detection, Removal, and Synthesis

S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Model with Spatio-Temporal Visual Representation

SATA: Spatial Autocorrelation Token Analysis for Enhancing the Robustness of Vision Transformers

A Theory of Learning Unified Model via Knowledge Integration from Label Space Varying Domains

HiLoTs: High-Low Temporal Sensitive Representation Learning for Semi-Supervised LiDAR Segmentation in Autonomous Driving

Spiking Transformer with Spatial-Temporal Attention

Perceptual Video Compression with Neural Wrapping

DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture

Towards More General Video-based Deepfake Detection through Facial Component Guided Adaptation for Foundation Model

WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model

SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization

Data-free Universal Adversarial Perturbation with Pseudo-semantic Prior

Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization

FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video

Rethinking Decoder Design: Improving Biomarker Segmentation Using Depth-to-Space Restoration and Residual Linear Attention

MDP: Multidimensional Vision Model Pruning with Latency Constraint

MaDCoW: Marginal Distortion Correction for Wide-Angle Photography with Arbitrary Objects

SynTab-LLaVA: Enhancing Multimodal Table Understanding with Decoupled Synthesis

Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing

DrVideo: Document Retrieval Based Long Video Understanding

Infighting in the Dark: Multi-Label Backdoor Attack in Federated Learning

Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors

PSHuman: Photorealistic Single-image 3D Human Reconstruction using Cross-Scale Multiview Diffusion and Explicit Remeshing

Tartan IMU: A Light Foundation Model for Inertial Positioning in Robotics

Event Ellipsometer: Event-based Mueller-Matrix Video Imaging

DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding

EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation

Handling Spatial-Temporal Data Heterogeneity for Federated Continual Learning via Tail Anchor

End-to-End HOI Reconstruction Transformer with Graph-based Encoding

Hiding Images in Diffusion Models by Editing Learned Score Functions

Disco4D: Disentangled 4D Human Generation and Animation from a Single Image

WeatherGen: A Unified Diverse Weather Generator for LiDAR Point Clouds via Spider Mamba Diffusion

IDOL: Instant Photorealistic 3D Human Creation from a Single Image

SketchVideo: Sketch-based Video Generation and Editing

PhysicsGen: Can Generative Models Learn from Images to Predict Complex Physical Relations?

Taste More, Taste Better: Diverse Data and Strong Model Boost Semi-Supervised Crowd Counting

Spectral Informed Mamba for Robust Point Cloud Processing

Balanced Direction from Multifarious Choices: Arithmetic Meta-Learning for Domain Generalization

Anatomical Consistency and Adaptive Prior-informed Transformation for Multi-contrast MR Image Synthesis via Diffusion Model

PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model

SeCap: Self-Calibrating and Adaptive Prompts for Cross-view Person Re-Identification in Aerial-Ground Networks

Leveraging Perturbation Robustness to Enhance Out-of-Distribution Detection

Neural Motion Simulator Pushing the Limit of World Models in Reinforcement Learning

Adversarial Diffusion Compression for Real-World Image Super-Resolution

BARD-GS: Blur-Aware Reconstruction of Dynamic Scenes via Gaussian Splatting

Towards Universal AI-Generated Image Detection by Variational Information Bottleneck Network

AlphaPre: Amplitude-Phase Disentanglement Model for Precipitation Nowcasting

V2V3D: View-to-View Denoised 3D Reconstruction for Light Field Microscopy

DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels

Splatter-360: Generalizable 360 Gaussian Splatting for Wide-baseline Panoramic Images

ShowMak3r: Compositional TV Show Reconstruction