Skip to yearly menu bar
Skip to main content
Main Navigation
CVPR
Code of Conduct
Create Profile
Reset / Forgot Password
Privacy Policy
Contact CVPR
HELP/FAQ
Reset Password
My Stuff
Login
Select Year: (2026)
2026
2025
2024
2023
Dates
Calls
Call for Papers
Call for Workshops
Call for Tutorials
Call for Demos
Call for Doctoral Consortium Participation
Call for Broadening Participation
Call for Art
Guides
Liability Waiver
Complete Your OpenReview Profile
Author Guidelines
Reviewer Guidelines
Author Compute Reporting
Author Compute Reporting Form
Author Compute Clarification
SAC Guides
AC Guides
Reviewer Training Material
Attend
Code of Conduct
Register
Book Hotel
Expo
Sponsors
Exhibitor Information
Expo Schedule
Exhibitor List and Floorplan
Media
Media Partners
Media Center
Get Media Pass
News and Resources
Organization
Organizing Committee
Layout:
mini
compact
topic
detail
×
No topics available
No sessions available
title
author
topic
session
shuffle
by
serendipity
bookmarked first
visited first
not visited first
bookmarked but not visited
Enable Javascript in your browser to see the papers page.
Drainage: A Unifying Framework for Addressing Class Uncertainty
Generative Modeling of Weights: Generalization or Memorization?
Unifying Precise Keyframes and Semantic Control via Multi-level Diffusion
MV2UV: Generating High-quality UV Texture Maps with Multiview Prompts
OmniBrainBench: A Comprehensive Multimodal Benchmark for Brain Imaging Analysis Across Multi-stage Clinical Tasks
MoCha: End-to-End Video Character Replacement without Structural Guidance
Ghosts in the Point Clouds: De-glaring LiDAR in the Transient Domain
Computer Vision with a Superpixelation Camera
Hierarchical Action Learning for Weakly-Supervised Action Segmentation
UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking
Decoupling Bias, Aligning Distributions: Synergistic Fairness Optimization for Deepfake Detection
HoneyBee: Data Recipes for Vision-Language Reasoners
Dynamic-Static Decomposition for Novel View Synthesis of Dynamic Scenes with Spiking Neurons
GUI-SAGE: Enhancing GUI Automation with Self-Explanatory Learning
ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks
SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning
VAD-GS: Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes
EchoPOSE: 6D Pose Estimation of Sparse Echocardiograms for Left-Ventricular 3D Shape Reconstruction
MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models
Diff-SemiER: Transparency-Aware Adaptive Fusion Diffusion Model with Generative Prior for Semi-Transparent Eyeglasses Removal
GeoSemba: Reconstructing State Space Model for Cross Paradigm Representation in Medical Image Segmentation
Towards Generalized Representations for Low-Light Understanding: When Signal Constancy Meets Semantic Enrichment
VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA
Sampling-Aware Quantization for Diffusion Models
ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models
SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models
Semantic-Adaptive Diffusion for Dynamic Spatiotemporal Fusion
Block-based Learned Image Compression without Blocking Artifacts
FastEventDGS: Deformable Gaussian Splatting for Fast Dynamic Scenes from a Single Event Camera
Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition
Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation
Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction
OMoBlur: An Object Motion Blur Dataset and Benchmark for Real-World Local Motion Deblurring
Mixture of States: Routing Token-Level Dynamics for Multimodal Generation
OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory
Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token
Soft Modality-Guided Expert Specialization in MoE-VLMs
Building Robust Vision Encoders for Cross-Dataset Evaluation in Immunofluorescent Microscopy
Differentiable Adaptive 4D Structured Illumination for Joint Capture of Shape and Reflectance
Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3-D Constrained Terrains
LinVideo: A Post-Training Framework towards $\mathcal{O}(n)$ Attention in Efficient Video Generation
FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation
Probabilistic Precipitation Nowcasting with Rectified Flow Transformers
SHAPE: Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation for Medical Image Segmentation
TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization
MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent
Spatio-Temporal Difference Guided Motion Deblurring with the Complementary Vision Sensor
The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection
Target-Aware Invertible Encoder with Reconstruction Guidance for Infrared Small Target Detection
Probabilistic Prompt Adaptation for Unified Image Aesthetics and Quality Assessment
ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and Test-time Generative Adaptation
Gloria: Consistent Character Video Generation via Content Anchors
Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving
Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal
AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks
FastHybrid: Accelerating Hybrid Autoregressive Image Generation with Lookahead and Guided Decoding
RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation
OpenMMReasoner: Pushing the Frontiers in Multimodal Reasoning with an Open and General Recipe
SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge
Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models
Beyond What's Shared: Recovering Lost Unique Information from Intermediate Layers to Boost Multimodal Geo-Foundation Models
Unified Camera Positional Encoding for Controlled Video Generation
Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification
Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models
FSLoRA: Harmonizing Detection and Re-Identification via Freq-Spatial Low-Rank Adapter for One-Stage Person Search
TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model
Bridge: Basis-Driven Causal Inference Marries VFMs for Domain Generalization
Re-evaluating Continual VQA: Toward Fair and Robust Evaluation for Multimodal Continual Learning
TV2TV: A Unified Framework for Interleaved Language and Video Generation
Inference-time Physics Alignment of Video Generative Models with Latent World Models
ReFAct: Empowering Multimodal Web Agents with Visual and Context Focusing
GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks
GVIS: Generative Vector Image Steganography
GDFA: Geometry-Driven Federated Unlearning with Directional Task Vector Alignment
MSPT: Efficient Large-Scale Physical Modeling via Parallelized Multi-Scale Attention
Stability-Driven Motion Generation for Object-Guided Human-Human Co-Manipulation
FMPose: 3D Pose Estimation via Flow Matching
Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer
QuCNet: Quantum Deep Learning Driven Multi-Circuit Network for Remote Sensing Image Classification
Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing
Saliency-Driven Token Merging for Vision Transformers
GeoCoT: Towards Reliable Remote Sensing Reasoning with Manifold Perspective
Volumetric Functional Maps
CC-VQA: Conflict- and Correlatoin-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering
MUFASA: A Multi-Layer Framework for Slot Attention
Enhancing Unregistered Hyperspectral Image Super-Resolution via Unmixing-based Abundance Fusion Learning
Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
Continual Learning by Reuse, New, Adapt and Skip: A Hierarchical Exploration-Exploitation Approach
Fast Markov Random Field Optimisation for Topologically Noisy 3D Shape Matching
Unsupervised Multi-Scale Segmentation of 3D Subcellular World with Stable Diffusion Foundation Model
MMGait: Towards Multi-Modal Gait Recognition
Learning Diffeomorphism for Medical Image Registration with Time-Embedded Architectures Using Semigroup Regularization
R$^2$TUA: Reconstruction-residual Based Targeted and Untargeted Attack Against Text-Image Person Re-Identification
MV-TAP: Tracking Any Point in Multi-View Videos
G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
M3Grounder: Mask-Based Multi-Span and Multi-Granular Grounding for Document QA
Multi-level Causal LLM-based Text-to-Motion Generation with Human Alignment
ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization
Zoo3D: Zero-Shot 3D Object Detection at Scene Level
VIAFormer: Voxel-Image Alignment Transformer for High-Fidelity Voxel Refinement
Towards Multimodal Domain Generalization with Few Labels
Splatent: Splatting Diffusion Latents for Novel View Synthesis
HAVE-Bench: Hierarchical Audio-Visual Evaluation from Perception to Interaction
Advancing Image Classification with Discrete Diffusion Classification Modeling
PARSE: Part-Aware Relational Spatial Modeling
RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval
Drift-Resilient Temporal Priors for Visual Tracking
Refer-Agent: A Collaborative Multi-Agent System for Referring Video Object Segmentation with Reasoning and Reflection
Nestwork: Conditional 3D Furnished House Layout Generation through Latent Heterogeneous Graph Diffusion
Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation
HAD: Heterogeneity-Aware Distillation for Lifelong Heterogeneous Learning
FAITHFUL CONTOURING: NEAR-LOSSLESS 3D VOXEL REPRESENTATION FREE FROM ISO-SURFACE
First Frame Is the Place to Go for Video Content Customization
Similarity-Consistent Likelihood Diffusion enables Hidden Person Detection from Wall Reflections
CodePercept: Code-Grounded Visual STEM Perception for MLLM
Guiding a Diffusion Model by Swapping Its Tokens
Arcadia: Toward a Full-Lifecycle Framework for Embodied Lifelong Learning
F$^2$-Assist: Multi-Phase Fetal Growth Forecast and Report Generation from Ultrasound Examination
FEAST: Fully Connected Expressive Attention for Spatial Transcriptomics
Hearing the Room Through the Shape of the Drum: Modal-Guided Sound Recovery from Multi-Point Surface Vibrations
TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models
OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding
Robustness Under Data Scarcity: Few-Shot Continual Adversarial Training for Evolving Threats
GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction
The Devil is in Attention Sharing: Improving Complex Non-rigid Image Editing Faithfulness via Attention Synergy
Visual Diffusion Models are Geometric Solvers
One Algorithm to Align Them All
Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?
OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios
ReFTA: Breaking the Weight Reconstruction Bottleneck in Tensorized Parameter-Efficient Fine-Tuning
On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models
SARL-STG: A Spatially Aware Reinforcement Learning Framework for Refining MLLMs in Spatio-Temporal Video Grounding
M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation
On the Role of Temporal Granularity in the Robustness of Spiking Neural Networks
Robust Spiking Neural Networks by Temporal Mutual Information
Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
GenHOI: Towards Object-Consistent Hand–Object Interaction with Temporally Balanced and Spatially Selective Object Injection
Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception
FM-Steer: Enhance Generalist Policies with Value-Guided Cascaded Denosing
CryoHype: Reconstructing a thousand cryo-EM structures with transformer-based hypernetworks
End-to-End Language-Action Model for Humanoid Whole Body Control
GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global–Local Feature Fusion
M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation
Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification
RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation
Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding
Multi-modal Test-time adaptation via Adaptive Probabilistic Gaussian Calibration
Test-Time Perturbation Tuning with Delayed Feedback for Vision-Language-Action Models
UZ3DVG: Unaided Zero-Shot 3D Visual Grounding with Generated Language Conditions
VVS: Accelerating Speculative Decoding for Visual Autoregressive Model via Partial Verification Skipping
Test-time Sparsity for Extreme Fast Action Diffusion
When LoRA Betrays: Backdooring Text-to-Image Models by Masquerading as Benign Adapters
UniVerse: Empower Unified Generation with Reasoning and Knowledge
ShapeAR: Generating Editable Shape Layers via Autoregressive Diffusion
VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation
PAMotion: Physics-Aware Motion Generation for Full-Body Interaction with Multiple Objects
Copy-Transform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints
Solvability of the Viewing Graph Under the Affine Camera Model
Parallel Jacobi Decoding for Fast Autoregressive Image Generation
AMap: Distilling Future Priors for Ahead-Aware Online HD Map Construction
LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging
Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM
Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video
Keep It Frozen: Domain-Routed Conditional Residual Modulation for Multi-Domain Vision Transformers
WeatherDiffusion: Controllable Weather Editing in Intrinsic Space
Cross-Modal Attention Calibration for LVLM Hallucination Mitigation
A Cross-view Fusion Framework for Robust 6-DoF Grasp Pose Estimation
Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates
RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cue for 3D Object Detection
WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces
AKCMamba-YOLO: Selective State Space Models For Real-Time Object Detection
Seeing Through Blur: Tackling Defocus in Spike-Based Imaging
Progressive Cross-Modal Causal Intervention for Long-Term Action Recognition
Subspace Alignment for CLIP-based Continual Learning via Canonical Correlation Analysis
Learning What Helps: Task-Aligned Context Selection for Vision Tasks
Learning Distribution-wise Foundation Prior Consistency and Instance-wise Style Calibration for Medical Image Generalization
Localizing, Structuring, and Rendering: Bridging 3D and 2D Vision-Language-Action Models for Robotic Manipulation
Physically Ground Commonsense Knowledge for Articulated Object Manipulation with Analytic Concepts
H2-Surv: Hierarchical Hyperbolic Multimodal Representation Learning for Survival Prediction
R$^2$-Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection
PhysGS: Bayesian-Inferred Gaussian Splatting for Physical Property Estimation
CAD-Refiner: A Unified Framework for CAD Generation and Iterative Editing
Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression
SEBA: Sample-Efficient Black-Box Attacks on Visual Reinforcement Learning
CROWn: A Unified Framework for Anti‑Aliased Downsampling and Phase‑Calibrated Fusion in 3D Medical Segmentation
ResDiT: Evoking the Intrinsic Resolution Scalability in Diffusion Transformers
TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection
OVOD-Agent: A Markov–Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection
UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting
MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts
Multi-Modal Image Fusion via Intervention-Stable Feature Learning
Rethinking Occlusion Modeling for UAV Tracking
Style-GRPO: Semantic-Aware Preference Optimization for Image Style Transfer Guided by Reward Modeling
Advancing Cancer Prognosis with Hierarchical Fusion of Genomic, Proteomic and Pathology Imaging Data from a Systems Biology Perspective
ManifoldNeuS: Manifold-aware View Optimizability for Pose-Free Neural Surface Reconstruction
PointTPA: Test-Time Parameter Adaptation for 3D Scene Understanding
IEBGL:An Interpretability-Enhanced Brain Graph Learning Framework with LLM-Instructed Topology and Literature-Augmented Semantics
GSNR: Graph Smooth Null-Space Representation for Inverse Problems
Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition
Diversity over Uniformity: Rethinking Representation in Generated Image Detection
ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding
InTrain: Intrinsic Trainability for Zero-Cost Neural Architecture Search
SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
TAPE: Task-Adaptive Prototype Evolution in Audio-Language Models for Fully Few-shot Class-incremental Audio Classification
UniCompress: Token Compression for Unified Vision–Language Understanding and Generation
HyCal: A Training-Free Prototype Calibration Method for Cross-Discipline Few-Shot Class-Incremental Learning
Beyond Weak Supervision: MLLMs-Guided Graded Knowledge Distillation for Unsupervised Camouflaged Object Detection
FlowComposer: Composable Flows for Compositional Zero-Shot Learning
Photo3D: Advancing Photorealistic 3D Generation through Structure‑Aligned Detail Enhancement
Scaling Zero-Shot Reference-to-Video Generation
TempoControl: Temporal Attention Guidance for Text-to-Video Models
SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia
Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives
URICA: A Uniformity Region Affine Identifier Capture Algorithm for Arbitrary Region Retrieval in Pathology Images
KAMP: Knowledge-Anchored Multimodal Pretraining Framework for Medical Image Representation
Harnessing Chain-of-Thought Reasoning in Multimodal Large Language Models for Face Anti-Spoofing
Assignment-Driven Hash Learning in a Hyper-Semantic Space for On-the-Fly Category Discovery
Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR
IDperturb: Enhancing Variation in Synthetic Face Generation via Angular Perturbations
Few-shot Acoustic Synthesis with Multimodal Flow Matching
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
When Pretty Isn’t Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators
VABench: A Comprehensive Benchmark for Audio-Video Generation
TaskIT: Memory-Efficient Fine-Tuning of Multi-LoRA LLMs via Cross-Task Importance Transfer
Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection
TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Dual-Level Scale-Oriented Contrast
Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
Towards Fine-Grained Attribution: Instance-Aware Preference Optimization for Aligning Diffusion Models
UniRain: Unified Image Deraining with RAG-based Dataset Distillation and Multi-objective Reweighted Optimization
RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark
MDCS-MoAME: Multi-directional Composite Scanning with Mixture of Attention and Mamba Experts for Cancer Survival Prediction
Think-Then-Generate: Structural Chain-of-Thought Reasoning for Consistent 3D Generation
Human-like Abstract Visual Reasoning via Understanding and Solving Reasoning Loop
Attribute-Preserving Pseudo-Labeling for Diffusion-Based Face Swapping
Spectral Conformal Risk Control: Distribution-Free Tail Guarantees via Bayesian Quadrature
Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence
Structure-Aware Representation Distillation for Tiny-Dense Object Segmentation
Unified Video Editing as Temporal Reasoner
MOGeo: Beyond One-to-One Cross-View Object Geo-localization
Unlearning without Forgetting: Securely Removing Targeted Concepts from Large-Scale Vision-Language Open-Vocabulary Detectors
BriMA: Bridged Modality Adaptation for Multi-Modal Continual Action Quality Assessment
LLM-Guided Probabilistic Fusion for Label-Efficient Document Layout Analysis
DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation
Geometry-aware Cross-modal Graph Alignment for Referring Segmentation in 3D Gaussian Splatting
VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents
Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image
PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On
RMAE-ProGRess: Advancing Semantic Segmentation in Unstructured Environments
ShapeR: Robust Conditional 3D Shape Generation from Casual Captures
JRM: Joint Reconstruction Model for Multiple Objects without Alignment
Virtual Nodes Guided Dynamic Graph Neural Network for Brain Tumor Segmentation with Missing Modalities
Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations
CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis
Unlocking Positive Transfer in Incrementally Learning Surgical Instruments: A Self-reflection Hierarchical Prompt Framework
Align Once to Explain: Feature Alignment for Scalable B-cosification of Foundational Vision Transformers
D2Cache: Second-Order Delta Caching for Higher Video Diffusion Acceleration
Affordance Field Intervention: Enabling VLAs to Escape Memory Traps in Robotic Manipulation
Few-for-Many Personalized Federated Learning
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
FPSBench: A Benchmark for Video Understanding at High Frame Rates
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Modeling the Visual Ambiguity of Human Sketches
Universal-to-Specific: Dynamic Knowledge-Guided Multiple Instance Learning for Few-Shot Whole Slide Image Classification
GaussianMatch: Semi-Supervised Regression with Pseudo-Label Filtering via Multi-View Gaussian Consistency
DiffBMP: Differentiable Rendering with Bitmap Primitives
ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation
Learning Spatial-Temporal Consistency for 3D Semantic Scene Completion
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation
Foca-VLA: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation
FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models
Synergistic Bleeding Region and Point Detection in Laparoscopic Surgical Videos
Tackling Alignment Ambiguity in Person Retrieval through Conversational Attribute Mining
Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis
Elucidating the SNR-t Bias of Diffusion Probabilistic Models
PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards
Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers
Anomaly-Related Residual Fields for Cross-domain Anomaly Detection
Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control
ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation
SaPaVe: Towards Active Perception and Manipulation in Vision-Language Action Models for Robot
SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts
General Process Reward Modeling for Robotic Reinforcement Learning
TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos
MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Towards Knowledge-augmented Bayesian Deep Learning For Computer Vision
Action-Sketcher: From Reasoning to Action via Visual Sketches for Robotic Manipulation
CGU-Bayes: Causal Graph Uncertainty-Guided Bayesian Inference for Domain Generalization
Scaling Up AI-Generated Image Detection with Generator-Aware Prototypes
Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention
PrivateEyes: Gaze-Preserving Anonymization for Data Sharing
SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection
NimbusGS: Unified 3D Scene Reconstruction under Hybrid Weather
ChronoGS: Disentangling Invariants and Changes in Multi-Period Scenes
PoseMaster: A Unified 3D Native Framework for Stylized Pose Generation
ArtLLM: Generating Articulated Assets via 3D LLM
Dataset Distillation via Influence Matching
EarlyTom: Early Token Compression Completes Fast Video Understanding
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes
DVAR: Dynamic Visual Autoregressive Modeling for Image Super-Resolution
Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning
SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization
CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection
Simple Agents Outperform Experts in Biomedical Imaging Workflow Optimization
NAMI: Efficient Image Generation via Bridged Progressive Rectified Flow Transformers
Dehallu3D: Hallucination-Mitigated 3D Generation from a Single Image via Cyclic View Consistency Refinement
Visual Grounding for Object Questions
Paper2Figure: A Multi-Agent Collaborative System for Figure Generation Towards Academic Research Paper
Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding
Medic-AD: : Towards Medical Vision-Language Model's Clinical Intelligence
Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars
Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer
Visual Personalization Turing Test
SemanticVLA: Towards Semantic Reasoning over Action Memorization via Synergistic Explicit Trace and Latent Action Planning
RegFormer: Transferable Relational Grounding for Efficient Weakly-Supervised Human-Object Interaction Detection
VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models
Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation
See What We Cannot See: A Geo-guided Reasoning Benchmark for Object Counting under Adverse Earth Observation Conditions
Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation
The Invisible Gorilla Effect in Out-of-distribution Detection
RegionRoute: Regional Style Transfer with Diffusion Model
DualPrim: Compact 3D Reconstruction with Positive and Negative Primitives
Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models
BoostSLT: Boosting Sign Language Translation via a Plug-and-Play Diffusion-Based Semantic Enhancer
PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation
OmniFM: Toward Modality-Robust and Task-Agnostic Federated Learning for Heterogeneous Medical Imaging
F2Net: A Frequency-Fused Network for Ultra-High Resolution Remote Sensing Segmentation
PhenoYieldNet: Learning Crop-Aware Phenological Responses for Multi-Crop Yield Prediction
FLOW: Optimal Transport-Driven Feature Warping for Generalized Remote Physiological Measurement
PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement
Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes
D2T2 - Multimodal Automated Planning for Brachytherapy
MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding
CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation
Omni-MMSI: Toward Identity-attributed Social Interaction Understanding
Aligning Text, Images and 3D Structure Token-by-Token
Fine-Grained Multi Image Object Hallucination Benchmark
GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents
Archon: A Unified Multimodal Model for Holistic Digital Human Generation
Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective
HeroGS: Hierarchical Guidance for Robust 3D Gaussian Splatting under Sparse Views
Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution Can Improve Robust Scene Graph Generation
StableMaterials: Enhancing Diversity in Material Generation via Semi-Supervised Learning
Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models
RemedyGS: Defend 3D Gaussian Splatting Against Computation Cost Attacks
Solving a Nonlinear Blind Inverse Problem for Tagged MRI with Physics and Deep Generative Priors
Hyperbolic Prototype Learning with Uncertainty-Aware Consistency for Continual Test-Time Segmentation
Radar-Guided Polynomial Fitting for Metric Depth Estimation
VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
Neural Distribution Prior for LiDAR Out-of-Distribution Detection
SounDiT: Geo-Contextual Soundscape-to-Landscape Generation
Data-Centric Meta-Learning for Robust Few-Shot Generalization
IMAIA: Interactive Maps AI Assistant for Travel Planning and Geo-Spatial Intelligence
NitroGen: An Open Foundation Model for Generalist Gaming Agents
Translating Signals to Languages for sEMG-Based Activity Recognition
SIR: Structured Image Representations for Explainable Robot Learning
Hear What You See: Video-to-Audio Generation with Diffusion Transformer and Semantic-Temporal Alignment-Ranked Direct Preference Optimization
AdaPrior: Bayesian-Inspired Adaptive Prior Correction for Long-Tailed Continual Learning
GeoWorld: Geometric World Models
Conflict-Aware Adaptive Cross-Reconstruction for Multimodal Sentiment Analysis
TimeBridge: Self-Supervised Video Representation Learning via Start-End Joint Embedding and In-Between Frame Prediction
Decoupling Defense Strategies for Robust Image Watermarking
Protego: User-Centric Pose-Invariant Privacy Protection Against Face Recognition-Induced Digital Footprint Exposure
Zero-shot Detection of AI-Generated Image via RAW-RGB Alignment
Bridging the Perception Gap in Image Super-Resolution Evaluation
UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation
Semantic-Guided Global-Local Collaborative Prompt Learning for Few-Shot Class Incremental Learning
Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought
Wavelet-Driven 3D Anomaly Detection under Pose-Agnostic and Sparse-View
Dark3R: Learning Structure from Motion in the Dark
Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere
PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction
VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network
TAG-MoE: Task-Aware Gating for Unified Generative Mixture-of-Experts
Energy Waveify and Redistribution for Test-Time Adaptation: A Control System Perspective
PromptEnhancer: Taming Your Rewriter for Text-to-Image Generation via Fine-Grained Reward
Contrastive Cross-Bag Augmentation for Multiple Instance Learning-based Whole Slide Image Classification
HybridDriveVLA: Vision-Language-Action model with Visual CoT reasoning and ToT Evaluation for Autonomous Driving
Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models
Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning
HCL-FF: Hierarchical and Contrastive Learning for Forward-Forward Algorithm
RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation
DriveVLN: Towards Mapless Vision-and-Language Navigation in Autonomous Driving
P2GS: Physical Prior-guided Gaussian Splatting for Photometrically Consistent Urban Reconstruction
LaS-Comp: Zero-shot 3D Completion with Latent–Spatial Consistency
LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents
RL‑ScanIQA: Reinforcement-Learned Scanpaths for Blind 360° Image Quality Assessment
Pose-Free Omnidirectional Gaussian Splatting for 360-Degree Videos with Consistent Depth Priors
Unifying Language-Action Understanding and Generation for Autonomous Driving
HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models
Focus on Background: Exploring SAM's Potential in Few-Shot Medical Image Segmentation with Background-Centric Prompting
Chain of World: World Model Thinking in Latent Motion
MedLIME: A Distribution-Aligned and Evidence-Supported Framework for Medical Saliency Explanations
BEV-CAR: Enhancing Monocular Bird’s Eye View Segmentation with Context-Aware Rasterization
Vista4D: Video Reshooting with 4D Point Clouds
NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity
Good Can Sometimes be Bad: A Unified Attack against 3D Point Cloud Classifier by a Flexible Isotropic Resampling
SHands: A Multi-View Dataset and Benchmark for Surgical Hand-Gesture and Error Recognition Toward Medical Training
PlanaReLoc: Camera Relocalization in 3D Planar Primitives via Region-based Structure Matching
DPL: Decoupled Prototype Learning for Enhancing Robustness of Vision–Language Transformers to Missing Modalities
Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep
The Power of Decaying Steps: Enhancing Attack Stability and Transferability for Sign-based Optimizers
Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion
Changes in Real Time: Online Scene Change Detection with Multi-View Fusion
OrienPose: Orientation-Guided Novel View Synthesis for Single-Image Unseen Object Pose Estimation
Towards High-resolution and Disentangled Reference-based Sketch Colorization
Test-Time 3D Occupancy Prediction
ViStoryBench: Comprehensive Benchmark Suite for Story Visualization
Urban-GS: A Unified 3D Gaussian Splatting Framework for Compact and High-Fidelity Aerial-to-Street Reconstruction
Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization
DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles
Prompt-Free Universal Region Proposal Network
InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy
Taming Generative Diffusion Model for Task-Oriented Infrared Imaging
GeoGuide: Hierarchical Geometric Guidance for Open-Vocabulary 3D Semantic Segmentation
Mixture-of-Experts based Feature Decoupling for Open Vocabulary Scene Graph Generation
Image-Based Outlier Synthesis With Training Data
TTRV: Test-Time Reinforcement Learning for Vision Language Models
Spe-BEVHead: Rethinking the Detection Head Design for Bird’s-Eye-View Object Detection
Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction: Dataset, Model and Benchmark
OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation
Beyond explicit language: plug-and-play visual-to-Linguistic modeling towards general object tracking
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
PaQ-DETR: Learning Pattern and Quality-Aware Dynamic Queries for Object Detection
Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT
RealAppiance: Let High-fidelity Appliance Assets Controllable and Workable as Aligned Real Manauls
Residual Connections Harm Self-Supervised Abstract Feature Learning
Repurposing 3D Generative Model for Autoregressive Layout Generation
Stereo World Model
RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning
MM-ACT: Learn from Multimodal Parallel Generation to Act
Time Without Time: Pseudo-Temporal Representation for Space-Time Super-Resolution
Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models
Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control
ExPose: Reinforcing Video Generation Models for Extreme Pose Estimation
EMMA: Extracting Multiple physical parameters from Multimodal Data
FAVE: A Structured Benchmark for Fine-Grained Audio-Visual Temporal Evaluation in Multimodal LLMs
FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants
Post-training feature pruning for fundus images classification
DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime
Adaptive Learned Image Compression with Graph Neural Networks
Learning Surgical Robotic Manipulation with 3D Spatial Priors
See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs
Synthetic Curriculum Reinforces Compositional Text-to-Image Generation
SkySense-VITA: Towards Universal In-context Segmentation of Multi-modal Remote Sensing Imagery
Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation
Beyond Tie Points: Satellite Image Block Adjustment based on Dense Feature Consistency
Anti-Degradation Lifelong Multi-View Clustering
Pressure2Motion: Hierarchical Human Motion Reconstruction from Ground Pressure with Text Guidance
Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration
WiseEdit: Benchmarking Cognition- and Creativity-Informed Image Editing
WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling
VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models
Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control
Probing and Bridging Geometry–Interaction Cues for Affordance Reasoning in Vision Foundation Models
Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration
FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing
Test-Time Attention Purification for Backdoored Large Vision Language Models
Ref4D-VideoBench: Four-Dimensional Reference-Based Evaluation of Text-to-Video Generative Models
PR Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency
FedAlign: Differentially Private Distribution Alignment for Non-IID Federated Learning
Flow4DGS-SLAM: Optical Flow-Guided 4D Gaussian Splatting SLAM
Functional Mean Flow in Hilbert Space
EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding
Moving Border Ownership for Event-based Motion Segmentation
Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following
LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol
Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics for Robot Manipulation
DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation
ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering
Bidirectional Normalizing Flow: From Data to Noise and Back
Rethinking Camera Choice : An Empirical Study on Fisheye Camera Properties in Robotic Manipulation
Stake the Points: Structure-Faithful Instance Unlearning
URScenes: A Multi-scenario Dataset for Unstructured Road Environments
GazeShift: Unsupervised Gaze Estimation and Dataset for VR
Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs
SFR-Net: Steering-Fusion-Refining Network in Multi-label Zero-Shot Sewer Defect Detection
Tea-Adapter: Teacher Adapter for Efficient Conditional Generation
Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models
ReMatch: Boosting Representation through Matching for Multimodal Retrieval
CLP: A Real-World Dataset of Contaminated Lens Protectors for Robust Semantic Segmentation
Efficient and High-Fidelity Omni Modality Retrieval
Scene Grounding in the Wild
Bidirectional Multimodal Prompt Learning with Scale-Aware Training for Few-Shot Multi-Class Anomaly Detection
Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models
Memory Matters: Boosting Training-Free Zero-Shot Temporal Action Localization with a Learnable Lookup Table
WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval
Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy
HTC-VLM: Disentangled Hybrid Token Compression for Vision-Language Models
PoseAnything: Universal Pose-guided Video Generation with Part-aware Temporal Coherence
BHCast: Unlocking Black Hole Plasma Dynamics from a Single Blurry Image with Long-Term Forecasting
CycleManip: Enabling Cycle-based Manipulation via Effective History Perception and Understanding
UAV-CB: A Complex-Background RGB–T Dataset and Local Frequency Bridge Network for UAV Detection
eRetinexGS: Retinex Modeling for Low-Light Scene Enhancement via Event Streams and 3D Gaussian Splatting
Harmonized Feature Conditioning and Frequency-Prompt Personalization for Multi-Rater Medical Segmentation
Polyphony: Diffusion-based Dual-Hand Action Segmentation with Alternating Vision Transformer and Semantic Conditioning
Seeing Through the Noise: Improving Infrared Small Target Detection and Segmentation from Noise Suppression Perspective
MVLM: Template-Free Tracking via Vision–Language Margin Confidence and Memory-Gated Tracking
Twin-T & TwintVQA: A Reliable Structure–Detail Separating VLM and a Comprehensive Benchmark for Chart and Table Tasks
OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera
EgoRoC: Towards Egocentric Robotic Control via Task-Agnostic Visual Alignment
SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images
FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters
AnyPcc: Compressing Any Point Cloud with a Single Universal Model
Revisiting Sparsity Constraint Under High-Rank Property in Partial Multi-Label Learning
Robust3DGSW: Toward Robust Watermarking for Quantization-Aware 3D Gaussian Splatting
Neuro-Cognitive Reward Modeling for Human-Centered Autonomous Vehicle Control
Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning
SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation
VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
EE-RL: Vision Language Guided Reinforcement Learning with Explorer and Expert model for End-to-End Autonomous Driving
CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions
The Golden Subspace: Where Efficiency Meets Generalization in Continual Test-Time Adaptation
Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction
SafeLogo: Turning Your Logos into Jailbreak Shields via Micro-Regional Adversarial Training
Parameter-efficient Continual Learning for Enhancing Plasticity without Forgetting under Limited Model Capacity
VoDaSuRe: A Large-Scale Dataset Revealing Domain Shift in Volumetric Super-Resolution
Specificity-aware reinforcement learning for fine-grained open-world classification
AdapAction: Adaptive Target Action Backdoor Attack against GUI Agents
Neural-Centric Video Processing Pipeline for Unified Multi-Task Inference
Harnessing the Power of Foundation Models for Accurate Material Classification
From Measurement to Mitigation: Quantifying and Reducing Identity Leakage in Image Representation Encoders with Linear Subspace Removal
Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study
M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh Reconstruction
BiomedCCPL: Causal Conditional Prompt Learning for Biomedical Vision-Language Models
Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models
Mechanisms of Object Localization in Vision–Language Models
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens
Spatiotemporal Pyramid Flow Matching for Climate Emulation
Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events
Physics-Guided Multistep Deformation Reversal for Ancient Bamboo Slip Restoration
TAP: A Token-Adaptive Predictor Framework for Training-Free Diffusion Acceleration
From Contrast to Consistency: Rethinking Event-based Continuous-Time Optical Flow Estimation
Rethinking SNN Online Training and Deployment: Gradient-Coherent Learning via Hybrid-Driven LIF Model
IMU-HOI: A Symbiotic Framework for Coherent Human-Object Interaction and Motion Capture via Contact-Conscious Inertial Fusion
SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models
FisherPoser: Human Motion Estimation from Sparse Observations with Hierarchical Region-Wise Fisher-Matrix Uncertainty Modeling
RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation
COPE: Consistent Occlusion and Prompt Enhancement Network for Occluded Person Re-identification
Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection
LiDeRe: A Lightweight Readout for Fast and Data-Efficient Dense Prediction
Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation
Adapting In-context Generation for Enhanced Composed Image Retrieval
MMLandmarks: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding
FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection
Taming the Long Tail: Rebalancing Adversarial Training via Adaptive Perturbation
HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation
MR. Illuminate: Zero-Shot Low-Light Image Enhancement with Diffusion Prior
Bypassing the Transport Plan: Dynamic Reweighting for Out-of-Distribution Detection with Optimal Transport
ChartR: Evaluating Reasoning Accuracy and Robustness in Chart Question Answering
VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions
EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models
SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation
MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning
StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives
GHPT: Real-Time Relightable Gaussian Splatting using Hybrid Path Tracing
Same Attention, Different Truths: Put Logit-Lens over Visual Attention to Detect and Mitigate LVLM Object Hallucination
SceMoS: Local Scene-Aware Human Motion Synthesis by Planning with Geometry-Grounded Tokens
FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding
The Drift Kernel: Why Diffusion Models Change Even When Told Not To
MDS-VQA: Model-Informed Data Selection for Video Quality Assessment
Hierarchical Point-Patch Fusion with Adaptive Patch Codebook for 3D Shape Anomaly Detection
Seeing Beyond 8bits: Subjective and Objective Quality Assessment of HDR-UGC Videos
CLIP Is Shortsighted: Paying Attention Beyond the First Sentence
BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling
HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT
SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition
Reviving ConvNeXt for Efficient Convolutional Diffusion Models
U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences
MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models
OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer
VDFE: Difference-Aware 3D Scene Editing with Non-Intrusive Video Diffusion Priors for Multi-View Consistency and Efficiency
TruckDrive: Long-Range Autonomous Highway Driving Dataset
DF$^2$-VB: Dual-level Fuzzy Fusion with View-specific Boosting for Multi-view Multi-label Classification
Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model
NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
Driving on Registers
Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
Towards Stable Federated Continual Test-Time Adaptation in Wild World
StyleTextGen: Style-Conditioned Multilingual Scene Text Generation
Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions
DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding
HierEdit : Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing
Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models
Boundary-Responsive Differentiable Gating for Superpixel-Based Segmentation
Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions
Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization
Through the Frequency Lens: Cross-Domain Generalisable Gaze Estimation with Adaptive Modulation
AutoDebias: An Automated Framework for Detecting and Mitigating Backdoor Biases in Text-to-Image Models
CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography
Charge: A Comprehensive Benchmark and Dataset for Dynamic Novel View Synthesis
Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting
RAM: Recover Any 3D Human Motion in-the-Wild
TAMER: A Tri-Modal Contrastive Alignment and Multi-Scale Embedding Refinement Framework for Zero-Shot ECG Diagnosis
DynBridge: Bridging Imagination and Control through Interaction Dynamics for Robot Manipulation
Towards Policy-Adaptive Image Guardrail: Benchmark and Method
MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
Revisiting Optimal Coding for I-ToF under Practical Sensor Constraints
Mitigating The Distribution Shift of Diffusion-based Dataset Distillation
Prototype-Guided Concept Erasure in Diffusion Models
Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding
UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
EvoGraph-R1: Self-Evolving Multimodal Knowledge Hypergraphs for Agentic Retrieval
Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions
Robust Promptable Video Object Segmentation
FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction
Rewis3d: Reconstruction for Weakly-Supervised Semantic Segmentation
VAST: Video Ability‑Stratified Taxonomy for Data‑Efficient Video Reasoning
Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs
Quota-Calibrated Fine-Grained Alignment with Context-Aware Marginals for Text-based Person Retrieval
Residual Diffusion Bridge Model for Image Restoration
DENALI: A Dataset Enabling Non-Line-of-Sight Spatial Reasoning with Low-Cost LiDARs
LaDy: Lagrangian-Dynamic Informed Network for Skeleton-based Action Segmentation via Spatial-Temporal Modulation
Spectral Scalpel: Amplifying Adjacent Action Discrepancy via Frequency-Selective Filtering for Skeleton-Based Action Segmentation
CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning
DDSF: Robust Few-Shot Learning via Disentangled Subspaces with Determinantal Point Process
AeroGS: Scale-Aware Gaussian Splatting for Pose-Free Dynamic UAV Scene Reconstruction
ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval
Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval
Explaining CLIP Zero-shot Predictions Through Concepts
BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting
Beyond Graph Model: Reliable VLM Fine-Tuning via Random Graph Adapter
EasyOmnimatte: Taming Pretrained Inpainting Diffusion Models for End-to-End Video Layered Decomposition
GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization
Med-CMR: A Fine-Grained Benchmark Integrating Visual Evidence and Clinical Logic for Medical Complex Multimodal Reasoning
PhysInOne: Visual Physics Learning and Reasoning in One Suite
Zero-Shot Feature Upsampling via Neighborhood Attention Filtering.
UniDef: Universal Defense Against Unauthorized Image Manipulation
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation
Confusion-Aware Spectral Regularizer for Long-Tailed Recognition
SynCLIP: Synonym-Coherent Language-Image Pretraining for Robust Open-Vocabulary Dense Perception
ASFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments
ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding
OneSparse: A Unified Framework for Sparse Activation Layers in Vision Models
UniChange: Unifying Change Detection with Multimodal Large Language Model
Scene-Centric Unsupervised Video Panoptic Segmentation
Failure Modes for Deep Learning–Based Online Mapping: How to Measure and Address Them
Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning
Bias In, Bias Out? Finding Unbiased Subnetworks in Vanilla Models
IAFMNet: Information-Aware Feature Modulation for Efficient Super-Resolution
LVLM-Aided Alignment of Task-Specific Vision Models
Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing
Benchmarking Unified Any-to-Any Interleaved Multimodal Learning
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
SpikeTrack: High-performance and Energy-efficient Event-Based Object Tracking with Spiking Neural Network
V2U4Real: A Real-world Large-scale Dataset for Vehicle-to-UAV Cooperative Perception
TGTrack: Temporal Generative Learning for Unified Single Object Tracking
Streamlined Knowledge Distillation
UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos
Rethinking BCE Loss for Multi-Label Image Recognition with Fine-tuning
InsCal: Calibrated Multi-Source Fully Test-Time Prompt Tuning for Object Detection
The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models
The Road Less Seen: Segment Exploration for Weakly Supervised Video Anomaly Detection
FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution
A Causal Marriage between VLM and IRM from Understanding to Reasoning
Order Matters: 3D Shape Generation from Sequential VR Sketches
UNI-OOD: Unified Object- and Image-level Out-of-Distribution Detection via Cross-Context Attentive Vision-Language Modeling
TiViBench: Benchmarking Think-in-Video Reasoning for Video Generation
Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters
TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events
SyncDreamer: Controllable and Expressive Avatar Generation Beyond the Talking Head
PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and Fuzz Optimization
From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity
CICA: Coupling Confidence-Aware Pretraining with Confidence-Informed Attention for Robust Multimodal Sentiment Analysis
Seeing Through the Shift: Causality-Inspired Robust Generalized Category Discovery
YOLO-ULM: Ultra-Lightweight Models for Real-Time Object Detection
Plug-and-Play Incomplete Multi-View Clustering via Janus-Faced Affinity Learning with Topology Harmonization
Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans
Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning
AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation
ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence On Mobile Devices
Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR
PoseGaussian: 6D Pose Estimation for Unseen Objects via Sparse-View Object-Level 3D Gaussian Splatting
LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration
Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation
OneThinker: All-in-one Reasoning Model for Image and Video
DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces
When to Think and When to Look: Uncertainty-Guided Lookback
Language-driven Fine-grained Retrieval
SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models
Just-in-Time: Tuning-Free Spatial Acceleration for Diffusion Transformers
ChordEdit: One-Step Low-Energy Transport for Image Editing
All-in-One Slider for Attribute Manipulation in Diffusion Models
Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset
DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures
Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation
Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
UniFusion: A Unified Image Fusion Framework with Robust Representation and Source-Aware Preservation
PRISM: Learning a Shared Primitive Space for Transferable Skeleton Action Representation
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
Learnability-Guided Diffusion for Dataset Distillation
VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues
Faster-GS: Analyzing and Improving Gaussian Splatting Optimization
NVGS: Neural Visibility for Occlusion Culling in 3D Gaussian Splatting
PromptDepth: Efficient and Promptable Geometric 3D Vision Model \\ for Embodied Intelligence
4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models
RankOOD - Class Ranking-based Out-of-Distribution Detection
Roots Beneath the Cut: Uncovering the Risk of Concept Recovery in Pruning-Based Unlearning for Diffusion Models
H$^{2}$A$^{2}$: Homogeneity-Aware and Heterogeneity-Aware Feature Perception for Unified Indoor 3D Object Detection
Towards Sparse Video Understanding and Reasoning
TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies
SketchVL: Policy Optimization via Fine-Grained Credit Assignment for Chart Understanding and More
VSRELL: A Simple Baseline for Video Super-Resolution and Enhancement in Low-Light environment
Few-Step Diffusion Sampling Through Instance-Aware Discretizations
Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training
Improving Diffusion Generalization with Weak-to-Strong Segmented Guidance
ViLearn: Accelerating Training Convergence of Image-to-3D Generation via Visibility Learning
Think Before You Drive: World Model-Inspired Multimodal Grounding
TOWARDS CALIBRATING PROMPT TUNING OF VISION- LANGUAGE MODELS
4C4D: 4 Camera 4D Gaussian Splatting
Differentiable Laplacian Matrix Guided Superpixel Segmentation
LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving
GeniNav: Generative Model Driven Image-Goal Navigation via Imagination-Guided Consistency Flow Matching
Reasoning Diffusion for Unpaired Text-Image to Video Generation
CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs
GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering
cryoSENSE: Compressive Sensing Enables High-throughput Microscopy with Sparse and Generative Priors on the Protein Cryo-EM Image Manifold
RADAR: VQ-VAE decoder of VAR is a good student for Restoring Against Degradation by Acceleration
FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain
MicroFM: Physics-guided Flow Matching for Isotropic Microscopy Reconstruction
Nonlinear Color Transfer via Learnable Bezier Flows
Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species
Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery
UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models
AVGGT: Rethinking Global Attention for Accelerating VGGT
Dual-Prototype-Guided Multi-task Learning for Unsupervised Anomaly Detection and Classification
CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal Reasoning
Learned Image Compression via Sparse Attention and Adaptive Frequency
Structural–Semantic Perception for Diffusion-Guided Temporal Forgery Localization
Structure-to-Intensity Diffusion for Adverse-Weather LiDAR Generation
HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars
Masked Region Transformer for Layered Image Generation and Editing at Scale
H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers
AGiLe: Learning Robust Long-Horizon Manipulation via Affordance-Grounded Bidirectional Latent Planning
MS^2Gait: A Multi-Scale Spatio-Temporal Fusion Network for LiDAR-based Gait Recognition
MotionHiFlow: Text-to-Motion via Hierarchical Flow Matching
UVU: Improving Multimodal Understanding via Vision-Language Unified Autoregressive Paradigm
HiLoRA: Hierarchical Low-Rank Adaptation for Personalized Federated Learning
Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery
MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data
Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features
Rounded or Streamlined Head? Bridging Concept Bottleneck Models and Attribute-Described Object Parts
GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding
Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors
Dynamic Stream Network for Combinatorial Explosion Problem in Deformable Medical Image Registration
SpikeTrack: A Spike-driven Framework for Efficient Visual Tracking
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
Fractal Camouflage: A Bio-Inspired Approach for Multi-Scale Adversarial Attacks in the Infrared Domain
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision
Part$^{2}$GS: Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting
PHANTOM: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
CASR: A Robust Cyclic Framework for Arbitrary Large-Scale Super-Resolution with Distribution Alignment and Self-Similarity Awareness
TRCoRSurg: Temporal-Relational Co-Reasoning for Surgical Video Triplet Recognition
CoRoGS: Contextual Gaussian Splatting for Robust Large-Deviation View Synthesis
MSAG: A Multispectral Aerial–Ground Benchmark for Any-Scenario Person Re-Identification
Efficient Equivariant Transformer for Self-Driving Agent Modeling
Learning to Refuse: Refusal-Aware Reinforcement Fine-Tuning for Hard-Irrelevant Queries in Video Temporal Grounding
History to Future: Evolving Agent with Experience and Thought for Zero-shot Vision-and-Language Navigation
D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation
Computation and Communication Efficient Federated Unlearning via On-server Gradient Conflict Mitigation and Expression
Factorized Context Aggregation for Robust Cancer Risk Estimation via Soft Re-Ranked Retrieval and Hierarchical Anchors
ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models
3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects
PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories
MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition
Real-Time Dynamic Scene Rendering with Controlled Compressibility and Contact Awareness
M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG
Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees
What’s Wrong with Synthetic Data for Scene Text Recognition? A Strong Synthetic Engine with Diverse Simulations and Self-Evolution
Adaptive 3D Perception Under Sparse Sampling via Reinforcement Learning
RFDM: Residual Flow Diffusion Models for Video Editing
EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing
Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention
THE MORE, THE MERRIER: CONTRASTIVE FUSION FOR HIGHER-ORDER MULTIMODAL ALIGNMENT
Object-Generalized Re-Identification: A Step Towards Universal Instance Perception
DSO: Direct Steering Optimization for Bias Mitigation
RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic Generation
OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control
Learning to Solve PDEs on Neural Shape Representations
SynthRGB-T: Language-Vision Guided Image Translation for Diversity Synthesis
NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training
Scene Reconstruction as Mapping Priors for 3D Detection
Dedelayed: Deleting remote inference delay via on-device correction
Interpretable Debiasing of Vision-Language Models for Social Fairness
SO-Bench: A Structural Output Evaluation of Multimodal LLM
SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
Outlier-Robust Diffusion Solvers for Inverse Problems
240FPS Stereo Vision from Monocular Mixed Spikes
BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation
FedHarmony: Harmonizing Heterogeneous Label Correlations in Federated Multi-Label Learning
Language Does Matter for Cross-Domain Few-Shot Visual Feature Enhancement
Understanding Counting Mechanisms in Large Language and Vision-Language Models
Ego2Web: A Web Agent Benchmark Grounded on Egocentric Videos
AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model
Gradient Knows Best: Mixed-Precision Quantization via Gradient-Guided Bit Allocation for Super-Resolution
VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs
STAR: Test-Time Adaptation Can Enhance Universal Prompt Learning for Vision-Language Models
Learning to Diversify and Focus: A Reinforcement Framework for Open-Vocabulary HOI Detection
3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion
Negative Binomial Variational Autoencoders for Overdispersed Latent Modeling
Towards Human-Imperceptible Backdoor Attacks on Text-to-Image Diffusion Models
REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement
HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images
One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer
RNED: Rotary Number Encoding and Decoding for Quantitative Medical VLM Analysis
Exploring Adaptive Masked Reconstruction for Self-Supervised Skeleton-Based Action Recognition
Anomaly as Non-Conformity via Training-Free Graph Laplacian Energy Minimization
AeroAgent: A Vision–Physics–Decision Framework for Aerodynamic Vehicle Design
LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map
Temporal Representation Enhancement (TRE): Learning to Forget Dominant Patterns for More Discriminative Spiking Features
SRA-Det: Learning Omni-Grained Open-Vocabulary Detection Beyond Category Names
Expanding mmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets
InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation
EVLF: Early Vision-Language Fusion for Generative Dataset Distillation
Orthogonal Spatial-Aware Multi-View Anchor Graph Clustering for Incomplete Remote Sensing Data
ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving
Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models
Neural Mixture Density Processes
CREward: A Type-Specific Creativity Reward Model
Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding
Hyper-PCN: Hypergraph-based Point Cloud Completion via High-order Correlation Modeling
SAIDO: Generalizable Detection of AI-Generated Images via Scene-Aware and Importance-Guided Dynamic Optimization in Continual Learning
Single-step Diffusion-based Video Coding with Semantic-Temporal Guidance
Semi-Supervised Conformal Prediction With Unlabeled Nonconformity Score
CoD: A Diffusion Foundation Model for Image Compression
W2W: Language-Model-Based Trajectory Prediction with Reinforcement Learning
A Temporal and Content Co-Awareness Latent Diffusion for Controllable Hand Image Generation
Generative Video Compression with One-Dimensional Latent Representation
Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning
Occluded Human Body Capture with Frequency Domain Denoising Prior
FlowDC: Flow-Based Decoupling-Decay for Complex Image Editing
TF-SSD: A Strong Pipeline via Synergic Mask Filter for Training-free Co-salient Object Detection
SAM2Text: Towards Prompt-Free and Multi-Resolution Video Scene Text Segmentation
GeoSANE: Learning Geospatial Representations From Models, Not Data
SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings
CoLoR: The Devil is in Scene Coordinate Regression for Large-Scale Visual Localization
More Than Meets the Eye: A Unified Image Fusion Framework via Semantic-Pixel Entropy Trade-off for Zero-Shot Generalization
SCoRe: Salience-Coverage Reduction for Vision Token Pruning in Vision-Language Models
From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings
SECOS: Semantic Capture for Rigorous Classification in Open-World Semi-Supervised Learning
Decision Boundary-aware Generation for Long-tailed Learning
When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse
CUE: Concept-Aware Multi-Label Expansion to Mitigate Concept Confusion in Long-Tailed Learning
Breaking the 3D Dataset Bottleneck: Fast Scalable Generation of Aligned 3D Assets from Scratch for Category 6D Pose Estimation and Robotic Grasping
Reallocating Attention Across Layers to Reduce Multimodal Hallucination
UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling
MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Driving
From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection
NeAR: Coupled Neural Asset–Renderer Stack
Towards Storytelling Animations: Joint Synthesis of Human and Camera Motions
Mamba Learns in Context: Structure-Aware Domain Generalization for Multi-Task Point Cloud Understanding
CMR-RD: Long-Tailed Adaptive VLM for Explainable CMR Diagnosis
DriveLaW: Unifying Planning and Video Generation in a Latent Driving World
DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images
Self-Diffusion Driven Blind Imaging
ParkGaussian: Surround-view 3D Gaussian Splatting for Autonomous Parking
U$^{2}$Flow: Uncertainty-Aware Unsupervised Optical Flow Estimation
Learning Latent Concepts for Detecting Out-of-Distribution Objects
Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection
Affordance-First Decomposition for Continual Learning in Video–Language Understanding
Contact-Aware Neural Dynamics
MAMMA: Markerless Accurate Multi-person Motion Acquisition
Beyond Duality: A Hybrid Framework of Leveraging Shared and Private Features for RGB-Event Object Detection
Addressing Exacerbated Attention Sink for Source-Free Cross-Domain Few-Shot Learning
NG-GS: NeRF-guided 3D Gaussian Splatting Segmentation
CoLC: Communication-Efficient Collaborative Perception with LiDAR Completion
Beyond Missing Modalities: Hypergraph Conditioned Diffusion for Uncertainty-Aware Multimodal Emotion Recognition
GS^2: Graph-based Spatial Distribution Optimization for Compact 3D Gaussian Splatting
FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing
Label What Matters: Modality-Balanced and Difficulty-Aware Multimodal Active Learning
Bidirectional Query-Driven Generation of Parametric CAD Sketch
ARC Is a Vision Problem!
Beyond Appearance: Camouflaged Object Detection via Geometric Structure
Say Cheese! Detail-Preserving Portrait Collection Generation via Natural Language Edits
Training-Free Open-Vocabulary Camouflaged Object Segmentation via Fine-Grained Object Binding and Adaptive Hybrid Prompt
EfficientVPR: Toward Efficient Visual Place Recognition via Scene-Aware Prompt Tuning and Adaptive Feature Enhancement
Illustrator’s Depth: Monocular Layer Index Prediction for Image Decomposition
PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation
Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World
Dense Metric Depth Completion from Sparse Direct Time-of-Flight Sensors
DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance
CIGMA: Causal Information-Gain Mechanistic Attribution of Attention Heads in Vision Transformers
Selective Amnesia using Contrastive Subnet Erasure for Class Level Unlearning in Vision Models
SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model
ULF-Loc: Unbiased Landmark Feature for Robust Visual Localization with 3D Gaussian Splatting
Closed-Form Concept Erasure via Double Projections
LNEM: Lunar Neural Elevation Model
Low-Resolution Editing is All You Need for High-Resolution Editing
Generalizable Radio-Frequency Radiance Fields for Spatial Spectrum Synthesis
WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens
TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning
PrivSynth: Alternating and Control-Based Optimization for Privacy and Utility in Synthetic Data
Spatio-Temporal Conditional Denoising Transformer for Modality-Missing RGBT Tracking
Progressive Multi-cue Alignment for Unaligned RGBT Tracking
MEMO: Human-like Crisp Edge Detection Using Masked Edge Prediction
Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images
Rethinking Knowledge Transfer in Image Quality Assessment: A Perceptual Preference Structure Alignment Perspective
ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation
CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning
Real-Time Multimodal Fingertip Contact Detection via Depth and Motion Fusion for Vision-Based Human–Computer Interaction
Haptic Neural Fields: Bringing Tactile Interactions to 3D Rendered Scenes
ImmerIris: A Large-Scale Dataset and Benchmark for Off-Axis and Unconstrained Iris Recognition in Immersive Applications
PureProof: Diffusion-Resistant Black-box Targeted Attack on Large Vision-Language Models
Discover, Segment, and Select: A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation
PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization
Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning
Easy2Hard: From Partially to Fully Unmatched Modalities as Negative Samples in Contrastive Learning
CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Bird’s-Eye-View Semantic Segmentation
Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation
Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection
Skyreels-Text: Fine-grained Font-Controllable Text Editing for Poster Design
HySeg: Learning Generative Priors for Structure-Aware Remote Sensing Segmentation
Bridging Domain Expertise and Generalization for Performance Estimation
Learning to Focus and Precise Cropping:A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
Goldilocks Test Sets for Face Verification
Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow
MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
PolySLGen: Online Multimodal Speaking–Listening Reaction Generation in Polyadic Interaction
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
Visual-RRT: Finding Paths toward Visual-Goals via Differentiable Rendering
Spectral Super-Resolution via Adversarial Unfolding and Data-Driven Spectrum Regularization: From Multispectral Satellite Data to NASA Hyperspectral Image
Breaking Multimodal LLM Safety via Video-Driven Prompting
ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors
SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding
Love Me, Love My Label: Rethinking the Role of Labels in Prompt Retrieval for Visual In-Context Learning
C$^2$FG: Control Classifier-Free Guidance via Score Discrepancy Analysis
FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy
EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling
Adversarial Style Optimization: Enhancing VLM Jailbreaks by GRPO-based Stylistic Triggers Optimization
RaUF: Learning the Spatial Uncertainty Field of Radar
One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control
Instance-level Visual Active Tracking with Occlusion-Aware Planning
CG-Floor: Centroid-Guided Diffusion for Large-Scale Floorplan Generation
Hierarchical Enhancement of Semantic Priors for Disentangled Text-Driven Motion Generation
DemoFunGrasp: Universal Dexterous Functional Grasping via Demonstration-Editing Reinforcement Learning
VENI: Variational Encoder for Natural Illumination
Learning to See through Illumination Extremes with Event Streaming in Multimodal Large Language Models
SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection
PaNDaS: Learnable Shape Interpolation Modeling with Localized Control
Phrase-grounded APO for Improving Chest X-ray Report Generation
LIBERO-Plus: A Progressive Robustness Benchmark for Visual-Language-Action Models
NanoSD: Edge Efficient Foundation Model for Real Time Image Restoration
Teacher-Guided Routing for Sparse Vision Mixture-of-Experts
BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird’s-Eye View Images
HP-Edit: A Human-Preference Post-Training Framework for Image Editing
DynamicTree: Interactive Real Tree Animation via Sparse Voxel Spectrum
Resolving the Stability-Plasticity Dilemma in Reinforcement Learning via Complementary Continual Critics
GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping
One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers
Rare-E2E: Rare Events Dataset for End-to-End Driving in Challenging Long-tail Scenarios
MOSAIC-GS: Monocular Scene Reconstruction via Advanced Initialization for Complex Dynamic Environments
RunawayEvil: Jailbreaking the Image-to-Video Generative Models
GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding
Seeing Both Sides: Towards Bidirectional Semantic Alignment for Open-Vocabulary Camouflaged Object Segmentation
MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis
QueryMe: Query-Driven Open-Vocabulary 3D Object Affordances Grounding from Multimodal Evidence
CHAL: Causal-guided Hierarchical Anomaly-aware Learning for Moving Infrared Small Target Detection
PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training
Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration
Towards Cross-Modal Preservation, Consistency and Alignment for Privacy-Preserving Visible-Infrared Person Re-Identification
SIGMA: A Physics-Informed Benchmark for Gas Chimney Understanding in Seismic Images
EmoStyle: Emotion-Driven Image Stylization
MeshRipple: Structured Autoregressive Generation of Artist-Meshes
DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO
MapRoute:Precise-Concept Erasing Mappers via Semantic Routing
VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm
Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model
Active Inference for Micro-Gesture Recognition: EFE-Guided Temporal Sampling and Adaptive Learning
$\textbf{FailureAtlas}$: Mapping the Failure Landscape of T2I Models via Active Exploration
RefTON: Person-to-Person Virtual Try-On with Unpaired Visual References
VAE-REPA: Variational Autoencoder Representation Alignment for Efficient Diffusion Training
When Token Pruning is Worse than Random: Understanding Visual Token Information in VLLMs
COPO: Causal-Oriented Policy Optimization for Hallucinations of MLLMs
VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning
DBMSolver: A Training-free Diffusion Bridge Sampler for High-Quality Image-to-Image Translation
LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference
BluRef: Unsupervised Image Deblurring with Dense-Matching References
PoseD-Flow: Versatile and Guided Flow Matching Model of Human Pose
Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation
Towards Stealthy and Effective Backdoor Attacks on Lane Detection: A Naturalistic Data Poisoning Approach
A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection
Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
Your One-Stop Solution for AI-Generated Video Detection
Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
Single-Round Scalable Analytic Federated Learning
FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration
Spatial Retrieval Augmented Autonomous Driving
VideoRealBench: A Chain-of-Thought Realism Evaluation Benchmark for Generated Human-Centric Videos
Motus: A Unified Latent Action World Model
Seeing What Matters: Visual Preference Policy Optimization for Visual Generation
Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods
Image Guides Images: Consistent Video Amodal Completion with Rectified In-Context Exemplar Guidance
Simple-ViLMedSAM: Simple Text Prompts Meet Vision-Language Models for Medical Image Segmentation
Where, What, Why: Toward Explainable 3D-GS Watermarking
AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models
LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models
UniGeoRS: A Unified Benchmark for Tri-view Geo-Localization
PlannerRFT: Reinforcing Diffusion Planners through Closed-Loop and Sample-Efficient Fine-Tuning
Fully Decentralized Certified Unlearning
SegGBC: Justifiable Coarse-to-Fine Granular-Ball Computing for Enhancing Clustering Image Segmentation
Stabilizing Feature Geometry in Noisy Pretrained Models for Robust Downstream Tasks
UCMNet: Uncertainty-Aware Context Memory Network for Under-Display Camera Image Restoration
Learning Multi-View Spatial Reasoning from Cross-View Relations
VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization
Complementary Prototype Mapping for Efficient Multimodal Anomaly Detection
From Panel to Pixel: Zoom-In Vision–Language Pretraining from Biomedical Scientific Literature
Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench
Eliminate Distance Differences Induced by Backdoor Attacks: Layer-Selective Training and Clipping to Mask Backdoor Models
How Far Can We Go With Synthetic Data for Audio-Visual Sound Source Localization?
KLIP: Localized Distribution Shift Detection via KL-Divergence with Diffusion Priors in Inverse Problems
FSFSplatter: Geometrically Accurate Reconstruction with Free Sparse-view Images within 2 minutes
Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Towards Robust Multi-Modal Semantic Segmentation with Teacher-Student Framework and Hybrid Prototype Distillation
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
Iterative Closed-Loop Motion Synthesis for Scaling the Capabilities of Humanoid Control
Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation
UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass
MeshMosaic: Scaling Artist Mesh Generation via Local-to-Global Assembly
Watch and Learn: Learning to Use Computers from Online Videos
HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image
Measure The Feature Universe: Topology-based Pseudo Labeling and Gravity Consistency for Source-Free Domain Adaptation
SAME: Sparse and Anchored Model Editing for Heterogeneous Incremental Learning under Limited Data
Revisiting 3D Reconstruction Kernels as Low-Pass Filters
MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models
Harmonic Canvas: Inversion-Free Editing for Visually-Guided Music Style Transfer
PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration
Rethinking Dataset Distillation: Hard Truths about Soft Labels
DeepfakeImpact: A Two-Stage Benchmark with Real-World Impact in Deepfake Detection
BiProLoRA: Bilevel Prompt LoRA for Real Scene Recovery
High-Fidelity Virtual Try-On beyond Paired Data Scarcity via Diffusion-based Cycle-Consistent Learning
Decoupled Generative Modeling for Human-Object Interaction Synthesis
SMV-EAR: Bring Spatiotemporal Multi-View Representation Learning into Efficient Event-Based Action Recognition
Hierarchical Attacks for Multi‑Modal Multi‑Agent Reasoning
WebGym: Scaling Training Environments for Long-Horizon Visual Web Agents with A Million Realistic Tasks
GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation
Real-World Point Tracking with Verifier-Guided Pseudo-Labeling
Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection
OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation
Zero-Shot Image Denoising via Hybrid Prior-Guided Pseudo Sample Generation
AcTTA : Rethinking Test-Time Adaptation via Dynamic Activation
UniLDiff: Unlocking the Power of Diffusion Priors for All-in-One Image Restoration
CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning
Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface
Physical Adversarial Examples through Camera Power Signal Injection
PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow
Implicit-Knowledge Visual Question Answering with Structured Reasoning Traces
SketchRevive: Fine-Grained Pixel-to-Vector Sketch Completion with Diffusion-Prior–Guided Multimodal LLMs
Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation
Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs
Activation Matters: Test-time Activated Negative Labels for OOD Detection with Vision-Language Models
VA-$\boldsymbol{\pi}$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation
Phantom: Physical Object Interactions as Dynamic Triggers for NMS-Exploited Backdoors
Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation
Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision
ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning
Logit-Margin Repulsion for Backdoor Defense
Jailbreaking Vision-Language Models via Dissonance-Guided Suffix Optimization and Image–Phrase Injection
The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification
Mario: Multimodal Graph Reasoning with Large Language Models
SyncMos: Scalable Motion Synchronisation for Multi-Agent Scene Interaction
Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation
DreamSAC: Learning Hamiltonian World Models via Symmetry Exploration
Nonparametric Deep Fine-grained Clustering with Low-Rank Guided Vision-Language Model
Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration
GR-Gauge: Cost-efficient Training Configuration By Gauging the Gradient Redundancy
SparseVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration
Out of Sight, Out of Track: Adversarial Attacks on Propagation-based Multi-Object Trackers via Query State Manipulation
Align While Search: Belief-Guided Exploratory Inference for World-Grounded Embodied Agents
World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models
Extend3D: Town-scale 3D Generation
Mixture of Style Experts for Diverse Image Stylization
LacTokGen: Latent Consistency Tokenizer for 1024-pixel Image Generation by 256 Tokens
FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance
Rethinking Two-Stage Referring-by-Tracking in Referring Multi-Object Tracking: Make it Strong Again
DiT-Distill: Open-Set Fine-Grained Retrieval via Generative Curriculum Knowledge
SemiGDA: Generative Dual-distribution Alignment for Semi-Supervised Medical Image Segmentation
Seeing Motion Through Polarity for Event-based Action Recognition
MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing
Muses: Designing, Composing, Generating Nonexistent Fantasy 3D Creatures without Training
DiP: Taming Diffusion Models in Pixel Space
VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset
MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images
Representation-Steered Incremental Adapter-Tuning for Class-Incremental Learning with Pre-Trained Models
DPGF-Net: Dual-Prior Guided Fusion Network for Joint Assessment of Perceptual Quality and Semantic Consistency in AI-Generated Images
STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction
TokenHand: Discrete Token Representation for Efficient Hand Mesh Reconstruction
Thermally Activated Dual-Modal Adversarial Clothing against AI Surveillance Systems
FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation
TACO: Task-Aware Contrastive Learning for Joint LiDAR Localization and 3D Object Detection
OpenT2M: No-frill Motion Generation with Open-source, Large-scale, High-quality Data
Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving
Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models
Variation-aware Vision Token Dropping for Faster Large Vision-Language Models
OmniDocLayout: Towards Diverse Document Layout Generation via Coarse-to-Fine LLM Learning
TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition
DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference
VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior
Benchmarking PhD-Level Coding in 3D Geometric Computer Vision
PAM: A Pose–Appearance–Motion Engine for Sim-to-Real HOI Video Generation
When Transformers Meet Mamba: A Hybrid Transformer-Mamba Network for Video Object Detection
LightRR: A Lightweight Network for Single Image Reflection Removal
Batman: Benign Knowledge Alignment Through Malicious Null Space in Federated Backdoor Attack
TimeRipples: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space
FedSDR: Federated Graph Learning with Structural Noise Detection and Reconstruction
AgentDet: A Shared-Blackboard Multi-Agent Framework for Zero-/Few-Shot Object Detection
InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models
LAOF: Robust Latent Action Learning with Optical Flow Constraints
Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding
DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving
Towards Photorealistic and Efficient Bokeh Rendering via Diffusion Framework
$x^2$-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space
ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
Detecting AI-Generated Forgeries via Iterative Manifold Deviation Amplification
HOPS: Hierarchical Open-vocabulary Part Segmentation with Attention-Aware Filtering and Affinity-Guided Enhancement
TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration
Beyond Caption-Based Queries in Video Moment Retrieval
Rethinking Box Supervision: Bias-Free Weakly Supervised Medical Segmentation
Towards Reliable Evaluation of Adversarial Robustness for Spiking Neural Networks
Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model
LoPrune: Efficient Data Pruning for LoRA-based Fine-Tuning of Vision Transformers
NuWa: Deriving Lightweight Class-Specific Vision Transformers for Edge Devices
Your Dissimilarities Define You: Complementary Learning Exploiting Class Diversities
OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments
SPREAD: Spatial-Physical Reasoning via gEometry Aware Diffusion
Towards Visual Query Localization in the 3D World
Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation
CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning
HierAmp: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation
WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World
V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence
Think, Then Verify: A Hypothesis–Verification Multi-Agent Framework for Long Video Understanding
RecEdit-Drive: 3D Reconstruction-Guided Spatiotemporal Video Editing for Autonomous Driving Scenes
EgoSound: Benchmarking Sound Understanding in Egocentric Videos
Omni-Supervised Motion Editing: Balancing Change and Invariance through Positive-Negative Learning
BA-GS: Bayesian Adaptive Gaussian Splatting for SFM-Free 3D Reconstruction
TerraSeg: Self-Supervised LiDAR Foundation Model for Ground Segmentation
Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework
When Anonymity Breaks: Identifying Models Behind Text-to-Image Leaderboards
Electromagnetic Inverse Scattering from a Single Transmitter
CoV-Align: Efficient Fine-grained Cross-Modal Alignment with Cohesive Visual Semantics Priority
The Power of Prior: Training-Free Open-Vocabulary Semantic Segmentation with LLaVA
Beyond Success: Refining Elegant Robot Manipulation from Mixed-Quality Data via Just-in-Time Intervention
Domain-Aware Federated Learning via Fisher-Guided Pruning
Guardians of the Hair: Rescuing Soft Boundaries in Depth, Stereo, and Novel Views
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence
PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis
CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction
FrankenMotion: Part-level Human Motion Generation and Composition
Foundation Encoders are All You Need for Personalized Image Generation
ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference
Your Classifier Can Do More: Towards Bridging the Gaps in Classification, Robustness, and Generation
Degradation-Consistent Test-Time Adaptation for All-in-One Image Restoration
Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views
Probabilistic Discrepancy Learning for Roadside LiDAR Scene Completion
Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought
Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation
Stronger Normalization-Free Transformers
PhysHead: Simulation-Ready Gaussian Head Avatars
PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference
WiTTA-Bench: Benchmarking Test-Time Adaptation for WiFi Sensing
OVSegDT: Segmenting Transformer for Open-Vocabulary Object Goal Navigation
SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs
HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation
How to Take a Memorable Picture? Empowering Users with Actionable Feedback
Markovian Scale Prediction: A New Era of Visual Autoregressive Generation
Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment
HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos
TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos
CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models
FlashMesh: Faster and Better Autoregressive Mesh Synthesis via Structured Speculation
SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs
PriVi: Towards a General-Purpose Video Model for Primate Behavior in the Wild
TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction
Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration
VKG-QA: Visual Knowledge Graph-based Question Answer for Large Multimodal Models
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
Humanoid Generative Pre-Training for Zero-Shot Motion Tracker
MM-OVSeg: Multimodal Optical–SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing
DarkAct: A RGB-Thermal Dataset and Fusion Framework for Multimodal Low-Light Action Recognition
Parameter-Efficient Adaptation for MLLMs via Implicit Modality Decomposition
Making the Classification Explanation Faithful to the Confidence Score
PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models
SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models
HiFICL: High-Fidelity In-Context Learning for Multimodal Tasks
DarkShake-DVS: Event-based Human Action Recognition under Low-light and Shaking Camera Conditions
EmoThinker: Advancing Visual-Acoustic Emotion Analysis via Structural Token Selection and Chain-of-Thought Reasoning
See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles
SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding
Diagram2Structure: Unlocking LLMs' Diagram Comprehension through DiagramDiff, an Offline Diagram Structuring Framework
SCE-Depth: A Spherical Compound Eye Framework for Wide FOV Depth Estimation
FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models
Inconsistency-aware Multimodal Schrödinger Bridge for Deepfake Localization
Editprint: General Digital Image Forensics via Editing Fingerprint with Self-Augmentation Training
BiGMINT: Biologically-guided Hierarchical Multimodal Integration for Modeling Multiple Compound Activities in Drug Discovery
IP-Adapter Is All You Need: Towards Fine-Tuning-Free Diffusion-Based Talking Face Generation
Detect Any AI-Counterfeited Text Image
Forensic-Friendly Image Manipulation via Controllable Latent Diffusion
BiOTPrompt: Bidirectional Optimal Transport Guided Prompting for Disease Evolution-aware Radiology Report Generation
AIMDepth: Asymmetric Image-Event Mamba for Monocular Depth Estimation
CHIRP dataset: towards long-term, individual-level, behavioural monitoring of bird populations in the wild
UniComp: Rethinking Video Compression Through Informational Uniqueness
Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval
MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models
MER-Tracker: Towards High-Speed 3D Point Tracking via Multi-View Event-RGB Hybrid Cameras
Dual-branch Distilled Transformer for Efficient Asymmetric UAV Tracking
Plan, Imagine, then Act: Steering Your VLA with Efficient Visually Grounded Planning
Dynamic Black-hole Emission Tomography with Physics-informed Neural Fields
Red-teaming Retrieval-Augmented Diffusion Models via Poisoning Knowledge Bases
Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
DeltaQuant: 4-bit Video Diffusion Models with Spatiotemporal Delta Smoothing
Breaking Smooth-Motion Assumptions: A UAV Benchmark for Multi-Object Tracking in Complex and Adverse Conditions
OS-Fed: One Snapshot Is All You Need
MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On
Adaptive Depth Lightweight RGB-T Tracking with Holistic Token Routing
OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models
Asking like Socrates: Socrates helps VLMs understand remote sensing images
Learning Forgery-Aware Lip Representations Without Forgery Priors
DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models
DARC: Dual Adjustment Reasoning with Counterfactuals for Trustworthy Chest X-ray Classification
SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting
CI-VID: A Coherent Interleaved Text-Video Dataset
Toward Early Quality Assessment of Text-to-Image Diffusion Models
VidEoMT: Your ViT is Secretly Also a Video Segmentation Model
Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes
UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes
OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models
HATS : Hardness-Aware Trajectory Synthesis for GUI Agents
MMBench-GUI: A Unified Hierarchical Evaluation Framework for Multi-Platform GUI Agents
NoOVD: Novel Category Discovery and Embedding for Open-Vocabulary Object Detection
Understanding, Accelerating, and Improving MeanFlow Training
SPE-MVS: Spatial Position Encoding Enhanced Multi-View Stereo with Monocular Depth Priors
Rethinking Token Reduction for Large Vision-Language Models
CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation
DK-DDIL: Adaptive Knowledge Retention for Dynamic Domain-Incremental Learning in Medical Imaging
POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation
Learning Latent Proxies for Controllable Single-Image Relighting
Token Warping Helps MLLMs Look from Nearby Viewpoints
HTTM: Head-wise Temporal Token Merging for Faster VGGT
OrthoFuse: Training-free Riemannian Fusion of Orthogonal Style-Concept Adapters for Diffusion Models
IFCSR: Inference-Free Fidelity-Realism Control for One-Step Diffusion-based Real-World Image Super-Resolution
Revisiting the Necessity of Full Accuracy: Weakly Supervised Object-Level Offset Correction for Misaligned Building Labels
Variational Graph-based Normal Integration
AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots
Dual-level Adaptation for Multi-Object Tracking: Building Test-Time Calibration from Experience and Intuition
MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes
Bridging Domains through Subspace-Aware Model Merging
UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision
TopoMA: Topology-Guided Multi-Agent Dense RGB 3D Reconstruction via Distributed Inference
Make it SING: Analyzing Semantic Invariants in Classifiers
ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
Structural Graph Probing of Vision–Language Models
Unleashing Vision-Language Semantics for Video Deepfake Detection
Graph-Guided Online Concept Erasure for Text-to-Image Diffusion Models
ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control
A Difference-in-Difference Approach to Detecting AI-Generated Images
AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models
NI-Tex: Non-isometric Image-based Garment Texture Generation
ReWeaver: Towards Simulation-Ready and Topology-Accurate Garment Reconstruction
PhaseWin Search Framework Enable Efficient Object-Level Interpretation
Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning
RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization
S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs
VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation
TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation
BiEvLight: Bi-level Learning of Task-Aware Event Refinement for Low-Light Image Enhancement
Cupid: Generative 3D Reconstruction via Joint Object and Pose Modeling
Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis
IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting
FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation
PerformRecast: Expression and Head Pose Disentanglement for Portrait Video Editing
DREAM: Document Recognition with Explicit Adaptive Memory
Enabling Supervised Learning of Generative Signatures for Generalized Synthetic Image Detection
Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals
SparseOIT: Improving Order-Independent Transparency 3DGS via Active Set Method
PRIMU: Uncertainty Estimation for Novel Views in Gaussian Splatting from Primitive-Based Representations of Error and Coverage
PointCSP: Cross-Sample Semantic Propagation and Stability Preservation in Self-Supervised Point Cloud Learning
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy
Diffusion Forcing Planner: History-Annealed Planning with Time-Dependent Guidance for Autonomous Driving
DFM-Drive: Parallel Coarse-to-Fine Motion Planning via Discrete Flow Matching for Autonomous Driving
HAMMER: Harnessing MLLMs via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding
EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence
SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models
Unstitching the Chimera: Frame-Level Risk and Train-Free Mitigation for Video Hallucination
WaTeRFlow: Watermark Temporal Robustness via Flow Consistency
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
Catch Me if You Can: Active Mapping of Moving 3D Objects
BD-Merging: Bias-Aware Dynamic Model Merging with Evidence-Guided Contrastive Learning
ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation
Ego: Embedding-Guided Personalization of Vision-Language Models
BEA-GS : BEyond RAdiance Supervision in 3DGS for Precise Object Extraction
Multi-Scale Speculative Decoding
NaTex: Seamless Texture Generation as Latent Color Diffusion
X-Part: High Fidelity And Structure Coherent Shape Decomposition And Completion
Content-Adaptive Hierarchical Hyperprior for Neural Video Coding
Immunizing Models Against Harmful Long-Horizon Fine-Tuning via Contractive Optimization Dynamics
SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance
EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing
Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction
Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment in CLIP
Beyond Reassembly: Fractured Object Recovery with Missing Parts
Breaking the Scalability Limit of Multi-Projector Calibration with Embedded Cameras
Text-Driven 3D Hand Motion Generation from Sign Language Data
Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models
Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner
FlowFixer: Towards Detail-Preserving Subject-Driven Generation
LiDAS: Lighting-driven Dynamic Active Sensing for Nighttime Perception
TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement
GraPHFormer: a multimodal graph persistent homology transformer for the analysis of neuroscience morphologies
AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment
Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D
A Bit is All You Need! Efficient Video Capture via Single Bit Imaging
CoIn: Coverage and Informativeness-Guided Token Reduction for Efficient Large Multimodal Models
Bridging the Modality Gap in Compositional Zero-Shot Learning via Sparse Alignment and Unimodal Memory Bank
TALON: Test-time Adaptive Learning for On-the-Fly Category Discovery
M⁴-SAM: Multi-modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection
Real-Time Long Horizon Air Quality Forecasting via Group-Relative Policy Optimization
Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs
Fine-VAD: Towards Fine-Grained Video Anomaly Detection via Progressive Cross-Granularity Learning
RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework
QD-PCQA: Quality-Aware Domain Adaptation for Point Cloud Quality Assessment
CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing
What Are You Doing? A Closer Look at Controllable Human Video Generation
DepthFocus: Controllable Depth Estimation for See-Through Scenes
EthoCLIP: Ontology-Enhanced Video-Language Pretraining for Animal Behavior Understanding
Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning
EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer
TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis
Momentum Memory for Knowledge Distillation in Computational Pathology
Progressive Neural Architecture Generation
OctoNav: Towards Generalist Embodied Navigation
HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action models
SPEGC: Continual Test-Time Adaptation via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation
Spectrum from Defocus: Fast Spectral Imaging with Chromatic Focal Stack
PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting
LOREAL: Mitigating Low-Resolution Challenges in Vision-Language Models with Attribute-driven Prompt Self-Distillation
One Layer’s Trash is Another Layer’s Treasure: Adaptive Layer-wise Visual Token Selection in LVLMs
CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
Beyond Single Solution: Multi-Hypothesis Deep Unfolding Network for Image Compressive Sensing
Language Models Can Explain Visual Features via Steering
DTG-Restore: Training-Free Diffusion Refinement for Generative Video Super-Resolution
Gamba: Mamba-based graph convolutional network with dynamic graph topology learning for action recognition
Draft and Refine with Visual Experts
Learning from Semantic Dictionaries: Discriminative Codebook Contrastive Learning for Unified Visual Representation and Generation
Meta-Learning In-Context Enables Training-Free Cross Subject Brain Decoding
SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion
Fingerprinting Diffusion models in the wild
FedMPT: Federated Multi-Label Prompt Tuning of Vision-Language Models
Efficient Weighted Sampling via Score-based Generative Models
EVA: Efficient Reinforcement Learning for End-to-End Video Agent
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Video Understanding
LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings
NTK-Guided Implicit Neural Teaching
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
SAMTok: Representing Any Mask with Two Words
BiPA: Bilevel Prompt Adaptation for Underwater Instance Segmentation
Bilevel Layer-Positioning LoRA for Real Image Dehazing
Diffusion MRI Transformer with a Diffusion Space Rotary Positional Embedding (D-RoPE)
Bridging Human Evaluation to Infrared and Visible Image Fusion
Benchmarking Endoscopic Surgical Image Restoration and Beyond
Streaming Diffusion Model for Fast Infrared and Visible Video Fusion
DiffGraph: An Automated Agent-driven Model Merging Framework for In-the-Wild Text-to-Image Generation
Dynamic Magic: Unleashing Restricted Knowledge for Lifelong Person Re-Identification
RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation
RiskProp: Collision-Anchored Self-supervised Temporal Constraints for Early Accident Anticipation
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
Denoise and Align: Towards Source-Free UDA for Robust Panoramic Semantic Segmentation
ReBaPL: Repulsive Bayesian Prompt Learning
Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols
Cov2Pose: Leveraging Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation
META: Meta Evolution of Tool Trajectory Adaptation for Long-Video Understanding
STARFlow-V: End-to-End Video Generative Modeling with Autoregressive Normalizing Flows
Stochastic Ray Tracing for the Reconstruction of 3D Gaussian Splatting
AVION: Aerial Vision–Language Instruction from Offline Teacher to Prompt-Tuned Network
Helios: Stable Latent Image Modeling for Multimodal Earth Observation
YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction
Unique Lives, Shared World: Learning from Single-Life Videos
Small Object, Great Challenge: A Benchmark for Small Object Visual Grounding
pH-Strips for Selective Forgetting: A Blunt but Fast Diagnostic Baseline for Machine Unlearning
Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models
Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining
Kaleidoscopic Scintillation Event Imaging
VisionLeaf: Entropy-Guided Leaf-First Reasoning for Efficient and Accurate Think-with-Image
VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis
MeshSplatting: Differentiable Rendering with Opaque Meshes
VL-Eraser: Vacuum Distillation for Machine Unlearning in Vision-Language Models
WPT: World-to-Policy Transfer via Online World Model Distillation
GenTract: Generative Global Tractography
SemLayer: Semantic Generative Segmentation and Layer Reconstruction for Vector Icons
SRGCD: Stability-Driven Region Growth Framework for 3D Change Detection
CVA: Context-aware Video-text Alignment for Video Temporal Grounding
MHopReg: Efficient Hierarchical Multi-Hop Graph Search for Point Cloud Registration
RelightAnyone: A Generalized Relightable 3D Gaussian Head Model
Meta-FC: Meta-Learning with Feature Consistency for Robust and Generalizable Watermarking
Dual-Level Hypergraph Generation for Addressing Feature Scarcity in Whole-Slide Image Classification
Scaling Dense Event-Stream Pretraining from Visual Foundation Models
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
Edit-aware RAW reconstruction
Perceptual 3D Simulation With Physical World Modeling
Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models
Physical Object Understanding with a Physically Controllable World Model
SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild
NERFIFY: Multi Agent Framework for Turning NeRF Papers into code
MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition–Perception–Reasoning Guided Text-Image Machine Translation
PromptLoop: Plug-and-Play Prompt Refinement via Latent Feedback for Diffusion Model Alignment
MERIT: Multi-domain Efficient RAW Image Translation
CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning
Best Segmentation Buddies for Image-Shape Correspondence
Event6D: Event-based Novel Object 6D Pose Tracking
VideoWorld 2: Learning Transferable Knowledge from Real-world Videos
Can a Second-View Image Be a Language? Geometric and Semantic Cross-Modal Reasoning for X-ray Prohibited Item Detection
M4V: Multimodal Mamba for Efficient Text-to-Video Generation
StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
Physics-Based Decomposition of Reflectance and Shading using a Single Visible-Thermal Image Pair
Circular-DPO: Aligning Multi-Stage 3D Generative Models via Preference Feedback Loop
ReasonEdit: Towards Reasoning-Enhanced Image Editing Models
MGDHand: Multi-Granularity Prior-to-Inertial Distillation Framework for Sequential 3D Hand Pose Estimation from Sparse IMUs
GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping
JarvisEvo: Towards a Self-Evolving Photo Editing Agent with Synergistic Editor-Evaluator Optimization
OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition
Garments2Look: A Multi-Reference Dataset for High-Fidelity Outfit-Level Virtual Try-On with Clothing and Accessories
Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance
Learning Anchor in Dual Orthogonal Space for Fast Multi-view Clustering
Masked-Diffusion Autoencoders for 3D Medical Vision Representation Learning
Open-Med-Reasoner: Data Recipes for Multimodal Medical Reasoning
DialogueVPR: Towards Conversational Visual Place Recognition
ProPhy: Progressive Physical Alignment for Dynamic World Simulation
RHCNet: Residual-Guided Hierarchical Calibration Network for Robust Underwater Object Detection
S2C2Seg: Semantic-Spatial Consistency and Category Optimization for Open-Vocabulary Segmentation
Incentivizing Versatile Video Reasoning in MLLMs via Data-Efficient Reinforcement Learning
InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs
From Manuals to Actions: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation
Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery
M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models
Memory-Augmented Scene Understanding and Exploration for Open-World Aerial Object-Goal Navigation
When Robots Should Say ''I Don’t Know'': Benchmarking Abstention in Embodied Question Answering
Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing
ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
Improving Motion in Image-to-Video Models via Adaptive Low-Pass Guidance
SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection
ExpoCM: Exposure-Aware One-Step Generative Single-Image HDR Reconstruction
UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection
AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention
ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction
OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery
Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs
Revisiting Unknowns: Towards Effective and Efficient Open-Set Active Learning
Video2Robo: 3DGS-based Synthetic Data from One Video Enables Scalable Robot Learning
CogniEdit: Dense Gradient Flow Optimization for Fine-Grained Image Editing
Prototype-based Causal Intervention for Multi-Label Image Classification
MacTok: Robust Continuous Tokenization for Image Generation
Lyapunov Probes for Hallucination Detection in Large Foundation Models
Will Multimodal Models Be Dazzled by Multi-Image Visual Puzzles?
TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation
SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation
Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation
Illumination-Consistent Human-Scene Reconstruction from Monocular Video
IR-HGP: Physically-Aware Gaussian Inverse Rendering for High-Illumination Scenes via Generative Priors
GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding
Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution
SDUIE: Semi-Supervised Diffusion for Underwater Image Enhancement with Quant-Text Dual Control
Towards Generalized Multimodal Homography Estimation
FoleyDirector: Directing Temporal Controllable Video-to-Audio Generation via Fine-Grained Temporal Scripts
Mitigating Error Amplification in Fast Adversarial Training
MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration
SpeeDiff: Scalable Pixel-Anchored End-to-End Latent Diffusion Model
ENC-Bench: A Benchmark for Evaluating Multimodal Large Language Models in Electronic Navigational Chart Understanding
GH-NAF: Grid-Adaptive Hash-Level–Attended Neural Attenuation Fields for Discrepancy-Aware CBCT
MetroGS: Efficient and Stable Reconstruction of Geometrically Accurate High-Fidelity Large-Scale Scenes
Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure
Selectively Extracting and Injecting Visual Attributes into Text-to-Image Models
Dynamic Token Reweighting for Robust Vision-Language Models
Low-Rank Test-Time Training for Pre-Trained Point Cloud Models
LangField4D: Learning Identity-Adaptive and Spatio-Temporal Continuous 4D Language Fields for Dynamic Scenes
Z-Order Transformer for Feed-Forward Gaussian Splatting
ParticleGS: Learning Neural Gaussian Particle Dynamics from Videos for Prior-free Physical Motion Extrapolation
ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding
More Natural, More Real: Object-aware Gaussian Splatting for 3D Visual Decoding from Human Brain
Progressive Supernet Training for Efficient Visual Autoregressive Modeling
Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
FedRE: A Representation Entanglement Framework for Model-Heterogeneous Federated Learning
TANGO: Text-Anchored Guided Optimization for Robust Fine-tuning Vision-Language Models under Label Noise
DyFCLT: Dynamic Frequency-Decoupled Cross-Modal Learning Transformer for Multimodal Tiny Object Detection
RecTok: Reconstruction Distillation along Rectified Flow
VisPlay: Self-Evolving Vision-Language Models
PIX-TAB: Efficient PIXel-Precise TABle Structure Recognition Approach with Speculative Decoding and Region-Based Image Segmentation
Benchmarking Single-Factor Physical Video-to-Audio Generation
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation
Lens Component Deletion based on Differentiable Ray Tracing
Teaching DINOv3 About Partial 3D Geometry: A Self-Supervised Geometry-Aware Approach
RINO: Rotation-Invariant Non-Rigid Correspondences
Multi-modal Frequency Decomposition Network for Semantic Scene Completion
SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models
ThinkGen: Generalized Thinking for Visual Generation
When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance
DiGraphHal-Bench: Evaluating Multimodal Large Language Models on Complex Directed Graphs
Ego-1K – A Large-Scale Multiview Video Dataset for Egocentric Vision
SARMAE: Masked Autoencoder for SAR Representation Learning
A Faster Path to Continual Learning
BrickNet: Graph-Backed Generative Brick Assembly
MORE-STEM: Long-Short MemOry REcall and Spatio-TEmporal Consistency Model for Query-Driven 3D/4D Point Cloud Segmentation
Generating Humanless Environment Walkthroughs from Egocentric Walking Tour Videos
PANDA: Pretraining for vision ANd language with Dense Alignment
Beyond the Static-World: Lifelong Learning for All-in-One Medical ImageRestoration
Retrieving Counterfactuals Improves Visual In-Context Learning
Linguistic Priors for Visual Decoupling: Towards Symmetric Vision-Brain Alignment
EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation
Tri-Subspaces Disentanglement for Multimodal Sentiment Analysis
Learnability-Driven Submodular Optimization for Active Roadside BEV Perception
Omni2Sound: A Fundamental Study on Dataset, Base Model, and Benchmark for Unified Video-Text-to-Audio Generation
MMDIR: Multimodal Instruction-Driven Framework for Mixed-Degradation Document Image Restoration
gQIR: Generative Quanta Image Reconstruction
Group Editing: Edit Multiple Images in One Go
HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis
Clone Deterministic 3D Worlds
Tokenizing Vector Animation for Autoregresive Generation
Building a Precise Video Language with Human–AI Oversight
Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection
COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation
RetFormer: Multimodal Retrieval for Enhancing Image Recognition
3D-Object Perception Transformer (3PT)
Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves
MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks
Towards Uncertainty-aware Unsupervised Domain Adaptation for Videos and Time-Series with Causal Optimal Transport
Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems
DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing
DROID-SLAM in the Wild
Bridging Facial Understanding and Animation via Language Models
Predict Before You Explore: Predictive Planning with Specialized Memory for Embodied Question Answering
DASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples
PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion
Reconstructing Functional 3D Scenes from Egocentric Interaction Videos
Learning 3D Shape Fidelity Metric from Real-world Distortions
Toward Generalizable Whole Brain Representations with High-Resolution Light-Sheet Data
tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction
Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration
Efficient Frame Selection for Long Video Understanding via Reinforcement Learning
RMIR: A Benchmark Dataset for Reasoning-Intensive Multimodal Image Retrieval
Head-wise Adaptive Rotary Positional Encoding for Fine-Grained Image Generation
ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data
KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing
A Polarized Reflection and Material Dataset of Real World Objects
SIF: Semantically In-Distribution Fingerprints for Large Vision-Language Models
Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models
Revisiting Model Stitching In the Foundation Model Era
MotionEdit: Benchmarking and Learning Motion-Centric Image Editing
Emergent Extreme-View Geometry in 3D Foundation Models
Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo
PAVAS: Physics-Aware Video-to-Audio Synthesis
VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
CG-Reasoner: Centroid-Guided Positional Reasoning Segmentation for Medical Imaging with a Robust Visual-Text Consistency Metric
SpeeDe3DGS: Speedy Deformable 3D Gaussian Splatting with Temporal Pruning and Motion Grouping
Thermal Diffusion Matters: Infrared Spatial-Temporal Video Super-Resolution through Heat Conduction Priors
Multi-Metric Representation Learning Strategy Based on Clustering for Fine-Grained Multimodal Sentiment Analysis
Den-TP: Density-Balanced Data Curation and Evaluation Framework for Trajectory Prediction
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Sky2Ground: A Benchmark for Site Modeling under Varying Altitude
Dual-Estimator: Decoupling Global and Local Semantic Shift for Drift Compensation in Class-Incremental Learning
SpatialStack: Layered Geometry-Semantic Fusion for 3D VLM Spatial Reasoning
Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting
Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective
Fusion of Depth and Semantic for Probabilistic Floorplan Localization
LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing
From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition
Interpretable Prompts made Edit-Friendly: Token-to-Token Similarity Reduction in dLLMs for Edit-Friendly Hard Prompt Inversion
Representing 3D Faces with Learnable B-Spline Volumes
Hist2Style: Histogram-Guided Stylization with Bilateral Grids
Breaking Spurious Correlations: Uncertainty-Driven Causal Transformers for AU Detection
AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video
Region-Aware Instance Consistency Learning for Micro-Expression Recognition
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
Mind the Hitch: Dynamic Calibration and Articulated Perception for Autonomous Trucks
Stable and Efficient Single-Rollout RL for Multimodal Reasoning
TriSim: Tri-Dimensional Similarity Modeling with Extreme Value Theory for False-Negative Mitigation in Remote Sensing Image-Text Retrieval
AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs
Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs
Mobile-VTON: High-Fidelity On-Device Virtual Try-On
MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models
DuetGen: Towards General Purpose Interleaved Multimodal Generation
LayoutAD: Exploring Semantic-Geometric Misalignment Reasoning for Scene Layout Anomaly Detection
HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction
Spatial-Spectral Residuals Informed Diffusion Neural Operator for Pan-sharpening
Native and Compact Structured Latents for 3D Generation
Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation
Fresco: Frequency–Spatial Consistent Optimization for Fine-Grained Head Avatar Modeling
DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval
CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving
Boosting Visual Reprogramming for Vision-Language Models with Dual Granularity Alignment
FUSAR-GPT: A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery
PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis
DRAMA: Next-Gen Dynamic Orchestration for Resilient Multi-Agent Ecosystems in Flux
ContourVertex: Bridging Semantics and Geometry for Referring Remote Sensing Interpretation
InterPhys: Physics-aware Human Motion Synthesis in a Dynamic Scene
ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval
AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization
Evidential Deep Partial Label Learning to Quantify Disambiguation Uncertainty
IMS3: Breaking Distributional Aggregation in Diffusion-Based Dataset Distillation
No Way To Steal My Face: Proactive Defense Against Identity-Preserving Personalized Generation
Beyond Text Prompts: Precise Concept Erasure through Text–Image Collaboration
Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation
GeoMotion: Rethinking Motion Segmentation via Latent 4D Geometry
LoFA: Learning to Predict Personalized Prior for Fast Adaptation of Visual Generative Models
OrionEdit: Bridging Reference and Source Images for Generalized Cross-Image Editing
Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models?
A Sanity Check for Multi-In-Domain Face Forgery Detection in the Real World
NeuroRule: Bridging Vision and Logic with Differentiable Rule Induction
Cross-Subject EEG-to-Video Reconstruction and Beyond
GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization through EM-Guided Decomposition and Temporal Refinement
MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention
Adapter Shield: A Unified Framework with Built-in Authentication for Preventing Unauthorized Zero-Shot Image-to-Image Generation
VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving
Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization
CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic Audiovisual Narrative Processing
Annotation-Efficient Coreset Selection for Context-dependent Segmentation
RoboWheel: A Data Engine from Real-World Human Demonstrations for Cross-Embodiment Robotic Learning
CryoKRAQEN: Kernel-Regularized Annealing for Quantized Embedding Networks in Cryo-EM Heterogeneous Reconstruction
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis
Pano360: Perspective to Panoramic Vision with Geometric Consistency
ORSATR-X: A Foundation Model based on Differential-and-Excitation Networks for Optical Remote Sensing Object Recognition
PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling
Beyond Layer-Wise Merging: Chain-of-Merging for Vision-Language Models
StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation
BAMI: Training-Free Bias Mitigation in GUI Grounding
SVAgent: Storyline-guided Long Video Understanding via Cross-modal Multi-agent Collaboration
EventDrive: Event Cameras for Vision–Language Driving Intelligence
ReMoE: Region-Mixture Experts for Adversarially-Robust Vision Transformers
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
Stepwise Credit Assignment for GRPO on Flow-Matching Models
The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition
Common Inpainted Objects In-N-Out of Context
LoG3D: Ultra-High-Resolution 3D Shape Modeling via Local-to-Global Partitioning
Enhancing Visual Representation with Textual Semantics: Textual Semantics-Powered Prototypes for Heterogeneous Federated Learning
SPOT: Spatiotemporal Prompt Optimization for Motion-Stabilized MLLM-Guided Video Segmentation
Reconstructing Spiking Neural Networks Using a Single Neuron with Autapses
PhaSR: Generalized Image Shadow Removal with Physically Aligned Priors
ReflexSplit: Single Image Reflection Separation via Layer Fusion-Separation
Real2Sim2Real: RetinalDepth-64K for Depth Estimation in Posterior Segment Ophthalmic Surgery
A Unified Perspective on Adversarial Membership Manipulation in Vision Models
InvCoSS: Inversion-driven Continual Self-supervised Learning in Medical Multi-modal Image Pre-training
HiconAgent: History Context-aware Policy Optimization for GUI Agents
Shedding Light on VLN Robustness: A Black-box Framework for Indoor Lighting-based Adversarial Attack
FedAFD: Multimodal Federated Learning via Adversarial Fusion and Distillation
DPAR: Dynamic Patchification for Efficient Autoregressive Visual Generation
EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer’s Disease
Complet4R: Geometric Complete 4D Reconstruction
IVAAN: Instance-level Vision-Language Alignment via Attribute-Guided Text Prompts Generation for Nuclei Analysis
MagicFuse: Single Image Fusion for Visual and Semantic Reinforcement
D$^2$-FOSA: Dual-Diffusion Guided EEG-to-Image Reconstruction with Frequency-Oriented Semantic Alignment
Envision, Attend, Then Respond: Counterfactual Hallucination Mitigation in Large Vision-Language Models
Intra-class Distribution-guided Generative Hashing with Neighbor Refinement for Cross-modal Retrieval
Multimodal Causality-Driven Representation Learning for Generalizable Medical Image Segmentation
Attack for Defense: Adversarial Agents for Point Prompt Optimization Empowering Segment Anything Model
EEGiT: Teaching Vision Transformers to Understand the EEG signal
IGen: Scalable Data Generation for Robot Learning from Open-World Images
Spatial-SAM: Spatially Consistent 3D Electron Microscopy Segmentation with SDF Memory and Semi-Supervised Learning
Correspondence-Attention Alignment for Multi-view Diffusion Models
FeatureFool: Zero-Query Fooling of Video Models via Feature Map
Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles
SMAP: Semantic Route Planning with Map-Grounded Multimodal Alignment
Towards Dynamic Modality Alignment in Multimodal Continual Learning
FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts
Scalable Feature Matching via State Space Modeling and Sparse Correlation
Grounded Chain-of-Thought for Multimodal Large Language Models
Hyperbolic Relational Prompts for Intersectional Fairness in Medical VLMs
Tavatar: Topology-Aware Gaussian Attribute Derivation for Animatable Human Avatars
RevINN: An End-to-End Invertible Neural Network for Reversible Adversarial Examples Generation
CaReFlow: Cyclic Adaptive Rectified Flow for Multimodal Fusion
Breaking the Continuum: Discrete Distribution Learning for Structural MRI Reconstruction
Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios
Focus, Don’t Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding
Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting
Progressive Guessing to Fixed Point: Rethinking Human Motion Prediction with Deep Equilibrium Models
Distilling Unsigned Distance Function for Surface Reconstruction from 3D Gaussian Splatting
QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer
FedDAP: Domain-Aware Prototype Learning for Federated Learning under Domain Shift
Physically-Grounded Turbulence Mitigation with Frame-Shared Degradation Parameters
Beyond Sequential Tools: A Unified VLM Agent System for Photographic Post-Processing via Dynamic Multi-Expert Fusion
MOSAIC3D:Modular Scene Assembly for Real-Time 3D Segment Anything
Edge-Focused Super-Resolution for Omnidirectional Images with Spherical Geometric Augmentation
Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs
UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resolution
DriveCTR: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving
Bi-Bridge: Bidirectional Diffusion Bridges for Low-Light Image Enhancement
Unsupervised Monocular 3D Keypoint Discovery from Multi-View Diffusion Priors
The Universal Normal Embedding
Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark
Multigrain-aware Semantic Prototype Scanning and Tri-token Prompt Learning embraced High-order RWKV for Pan-sharpening
SenseSearch: Empowering Vision-Language Models with High-Resolution Agentic Search-Reasoning via Reinforcement Learning
Image-Guided Geometric Stylization of 3D Meshes
Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image
Linking Perception, Confidence and Accuracy in MLLMs
Tracking through Severe Occlusion via Event-Derived Transient Cues
MPL: Match-guided Prototype Learning for Few-shot Action Recognition
BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment
FastGaMer: Efficient GainMap Learning for Practical Inverse Tone Mapping
Adaptive Spatial-Temporal Window: Unlocking the Potential of Event Cameras in Heterogeneous Velocity Scenarios
DART: Dynamic ModAlity-balanced Multimodal RepresenTation Learning for E-commerce Product Understanding
AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers
Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation
Exposing and Evaluating Hallucinations for GUI Grounding
BiPreManip: Learning Affordance-Based Bimanual Pre-Manipulation through Anticipatory Collaboration
Adaptive Sparsity for Efficient Long-Video Understanding
BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition
ROSE: Rotate Your Large Language Model to See
LiDAR-to-4DRadar Diffusion Bridge via Cross-Modal Alignment and Translation in Latent Space
PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion
Hi-Lo Prune: Look at What You'll Lose before Pruning with Hierarchical Token Selection
SANER: Switchable Adapter with Non-parametric Enhanced Routing for Person De-Reidentification
Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation
EVObject: Learning Evolving Object-centric Representations for 3D Instance Segmentation without Scene Supervision
SDTrack: A Baseline for Event-based Tracking via Spiking Neural Networks
LookasideVLN: Direction-Aware Aerial Vision-and-Language Navigation
Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos
Long-Term Personalized Multimodal LLMs
FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-and-Language Navigation
Making Training-Free Diffusion Segmentors Scale with the Generative Power
Neurodynamics-Driven Coupled Neural P Systems for Multi-Focus Image Fusion
Quantum-Gated Task-interaction Knowledge Distillation for Pre-trained Model-based Class-Incremental Learning
Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning
AR²-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos
XSeg: A Large-scale X-ray Contraband Segmentation Benchmark For Real-World Security Screening
Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures
Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning
Deformation-based In-Context Learning for Point Cloud Understanding
EvoID: Reinforced Evolution for Identity-Preserving Video Generation
HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models
Streaming Video Crime Anticipation with Spatio-Temporal Causal Reasoning
Sparsely Timing the Change: A Spiking Temporal Framework for Remote Sensing Interpretation
A supervised multi-task framework for joint cryo-ET restoration enabled by generative physical simulation
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
MedKCO: Medical Vision-Language Pretraining via Knowledge-Driven Cognitive Orchestration
Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs
HyperST: Hierarchical Hyperbolic Learning for Spatial Transcriptomics Prediction
DRM: Diffusion-based Reward Model With Step-wise Guidance
Understanding Task Transfer in Vision-Language Models
Learning to See Through a Baby’s Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines
Reconstructing CLIP for Open-Vocabulary Dense Perception
MaskFocus: Focusing Policy Optimization on Critical Steps for Masked Image Generation
Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs
AnchorSplat: Feed-Forward 3D Gaussian Splatting With 3D Geometric Priors
View-Aware Semantic Alignment for Aerial-Ground Person Re-Identification
Consistency Beyond Contrast: Enhancing Open-Vocabulary Object Detection Robustness via Contextual Consistency Learning
Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache
FlowPortal: Residual-Corrected Flow for Training-Free Video Relighting and Background Replacement
ProcessMaker: A Generalized Process Visualization Framework with Adaptive Sequence Steps on Diffusion Transformers
Learning Scene Coordinate Reconstruction from Unposed Images via Pose Graph Optimization
HumanBA: Human-Aware Bundle Adjustment via Global Human-Camera Decoupling
CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think
STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval
High-Quality and Efficient Turbulence Mitigation with Events
SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls
Exploring Spatial Intelligence from a Generative Perspective
Don’t Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs
SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation
From Infusion to Assimilation Distillation for Medical Image Segmentation
GS-ASM: 2DGS-Supervised Active Stereo Matching
TRIDENT: A Trimodal Cascade Generative Framework for Drug and RNA-Conditioned Cellular Morphology Synthesis
RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding
SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark
X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis
SO(3)-Equivariant ViT-Adapter for Data-Efficient Zero-Shot Sim-to-Real Indoor Panoramic Depth Estimation
NEC-Diff: Noise-Robust Event–RAW Complementary Diffusion for Seeing Motion in Extreme Darkness
MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding
Consistent Instance Field for Dynamic Scene Understanding
SAMIX: Reinforcing SAM2 with Semantic Adapter and Reference Selecting Policy for Mix-Supervised Segmentation
AviaSafe: A Physics-Informed Data-Driven Model for Aviation Safety–Critical Cloud Forecasts
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models
Hunting Normality from Query Sample via Residual Learning for Generalist AnomalyDetection
CoLoGen: Progressive Learning of Concept–Localization Duality for Unified Image Generation
Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation
BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation
Texvent: Asynchronous Event Data Simulation via Text Prompt
Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis
Think 360°: Evaluating the Width-centric Reasoning Capability of MLLMs Beyond Depth
PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs
UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders
Masked Representation Modeling for Domain-Adaptive Segmentation
TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration
Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation
See Through the Noise: Improving Domain Generalization in Gaze Estimation
Language-Guided One-Step Diffusion Model for Nighttime Flare Removal
Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction
Reading Your Actions: Learning Generalizable Action Representations via Pre-training AEMG
Knowing Thyself: Ego-Grounding for Personalized Question-Answering in Egocentric Videos
UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation
Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models
AHS: Adaptive Head Synthesis via Synthetic Data Augmentations
MultiAnimate: Pose-Guided Image Animation Made Extensible
TTL: Test-time Textual Learning for OOD Detection with Pretrained Vision-Language Models
FedCART: Tackling Long-Tailed Distributions in Federated Adversarial Training via Classifier Refinement
Rethinking Cross-Modal Anchor Alignment for Mitigating Error Accumulation
HyperGait: Unleashing the Power of Parsing for Gait Recognition in the Wild via Hypergraph
Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning
Geoint-R1: Formalizing Multimodal Geometric Reasoning with Dynamic Auxiliary Constructions
GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models
RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning
FedAdamom: Adaptive Momentum for Improved Generalization in Federatedd Optimization
E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving
IrisFP: Adversarial-Example-based Model Fingerprinting with Enhanced Uniqueness and Robustness
SAQN: Semantic-based Adaptive Query Network for 3D Referring Expression Segmentation
UAVLight: A Benchmark for Illumination-Robust 3D Reconstruction in Unmanned Aerial Vehicle (UAV) Scenes
TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models
FlowPalm: Optical Flow Driven Non-Rigid Deformation for Geometrically Diverse Palmprint Generation
Submodel Extraction for Efficient and Personalized Federated Learning via Optimal Transport
MERLIN: Building Low-SNR Robust Multimodal LLMs for Electromagnetic Signals
CompBench: Benchmarking Complex Instruction-guided Image Editing
Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures
Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs
VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer
Rethinking Asymmetric Quantization: Hidden Symmetry in Vision Model Weights
Interactive Tracking: A Human-in-the-Loop Paradigm with Memory-Augmented Adaptation
Steering Where to Diffuse: Generative Modeling of Phenotypic Response Simulation with Steered Diffusion Bridge
Coordinate Denoising for Non‑Equilibrium Molecular Representation Learning
TrajRAG: Retrieving Geometric-Semantic Experience for Zero-Shot Object Navigation
Multi-Scale Gaussian-Language Map for Embodied Navigation and Reasoning
See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis
Machine Unlearning via Adaptive Gradient Reweighting and Multi-stage Objective Optimization
A More Word-like Image Tokenization for MLLMs
Fixed Anchors Are Not Enough: Dynamic Retrieval and Persistent Homology for Dataset Distillation
Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation
AD-GBC: Anisotropic Granular-Ball Skip-Connection Refiner for UNet-Based Medical Image Segmentation
Temporal Interaction in Spiking Transformers with Multi-Delay Mixer
Sparse Task Vector Mixup with Hypernetworks for Efficient Knowledge Transfer in Whole-Slide Image Prognosis
SCIEval: Evaluating and Benchmarking the Faithfulness of Scientific Image Generation and Interpretation with Large Multimodal Models
Topology-aware Feature Propagation for Unsupervised Non-rigid Point Cloud Correspondence
Personalized Longitudinal Medical Report Generation via Temporally-Aware Federated Adaptation
Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal
Universal Computational Aberration Correction: A Comprehensive Benchmark Analysis
MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectioanl Blending with Hierarchical Densification
GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning
Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning
CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization
BabyVLM v2: Toward Developmentally Grounded Vision–Language Models with Real Infant-View Data and Cognitive Evaluation Benchmarks
Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset
HiFi-Brep: High-Fidelity B-Rep Latent Representation and Robust Generation
Prototype-as-Prompt: Multimodal Sentiment Prototypes Endowing Large Language Models the Capability to Perform Multimodal Sentiment Analysis
Life-IQA: Boosting Blind Image Quality Assessment through GCN-enhanced Layer Interaction and MoE-based Feature Decoupling
DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Models
TableMix: Enhancing Multimodal Table Reasoning in MLLMs from a Data-Centric Perspective
FlowMotion: Training-Free Flow Guidance for Video Motion Transfer
Dropping Anchor and Spherical Harmonics for Gaussian Splatting
Parameterized Prompt for Incremental Object Detection
LiteSense: Lifting Lightweight ToF with RGB for High-Resolution Metric Depth Estimation
Anchoring the Mind of Multimodal Reasoners: Cognitive Bias as a Vector for Jailbreak Attacks
Improving Adversarial Transferability with Local Perturbation Augmentation
Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation
Local Motion Matters: A Deconstruct–Recompose Paradigm for Reinforcement Learning Pre-training from Videos
CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval
PECCVAI : Overcoming the Brittleness of AI Image Watermarking Under Visual Paraphrasing Attacks
NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing
RewardFlow: Generate Images by Optimizing What You Reward
PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
Dynamic Visual SLAM using a General 3D Prior
FreqEdit: Preserving High-Frequency Features for Robust Multi-Turn Image Editing
From 3D Pose to Prose: Biomechanics-Grounded Vision–Language Coaching
Chain of Event-Centric Causal Thought for Physically Plausible Video Generation
Optical Flow Matching: Reframing Optical Flow as Continuous Transport Dynamics
From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training
UETrack: A Unified and Efficient Framework for Single Object Tracking
Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models
High Resolution Neural Video Coding with Bi-directional Confidence-Guided Reference Information Modeling
Discovering Adaptive Task Dependencies for Efficient Multi-Task Representation Compression
Understanding Temporal Logic Consistency in Video-Language Models through Cross-Modal Attention Discriminability
Selective, Regularized, and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation
Understanding and Enforcing Weight Disentanglement in Task Arithmetic
Gaze Target Estimation with Concepts
SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras
Egocentric Visibility-Aware Human Pose Estimation
Neural Collapse in Test-Time Adaptation
Hierarchical Codec Diffusion for Video-to-Speech Generation
RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding
Improving Sparse Autoencoder with Dynamic Attention
Medical Video Diagnosis via Counterfactual Reasoning
An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving
The Midas Touch for Metric Depth
Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code
Bringing Your Portrait to 3D Presence
Mirai: Autoregressive Visual Generation Needs Foresight
DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior
DreamOmni2: Multimodal Instruction-based Generation and Editing
Enhancing the Security of Visual Speaker Authentication Based on Dynamic Lip-Print Analysis
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery
Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Any Camera
Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
Gravitation-Driven Semantic Alignment for Text Video Retrieval
TagSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and Tracking
PosterIQ: A Design Perspective Benchmark for Poster Understanding and Generation
CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection
PAUL: Uncertainty-Guided Partition and Augmentation for Robust Cross-View Geo-Localization under Noisy Correspondence
Hierarchical Process Reward Models are Symbolic Vision Learners
ADSeeker: A Knowledge-Grounded Reasoning Framework for Industry Anomaly Detection and Reasoning
Adapting Lightweight Image-based Counting Models for Video Crowd Counting
RoSAMDepth: Robust Self-supervised Depth Estimation Leveraging Segment Anything Model
Portable Active Learning for Object Detection
CraftMesh: High-Fidelity Generative Mesh Manipulation via Poisson Seamless Fusion
Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance
PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
Global-Graph Guided and Local-Graph Weighted Contrastive Learning for Unified Clustering on Incomplete and Noise Multi-View Data
MoCoDiff: A Controllable Autoregressive Diffusion Model for Expressive Motion Generation
APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation
RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting
Accelerating Diffusion Model Training under Minimal Budgets: A Condensation-Based Perspective
Seeing Depth Through Frequency and Motion: A Progressive Training Paradigm for Monocular Depth Estimation
VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space
Generalizable Video Quality Assessment via Weak-to-Strong Learning
A³: Towards Advertising Aesthetic Assessment
Active Perceptual Inference: A Corticothalamic-Inspired Dynamic Nested Recurrent Network for Multimodal Sentiment Analysis with Incomplete Data
Parallax to Align Them All: An OmniParallax Attention Mechanism for Distributed Multi-View Image Compression
Beyond the Static World: Continual Category Discovery under Visual Drift
Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion
MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification
Cross-domain Dual-stream Feature Disentanglement for Brain Disorder Prediction with Sparsely Labeled PET
Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization
Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception
RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces
ProSoftArena: Evaluating Hierarchical Capabilities of Multimodal Agents in Professional Software Environments
When Lines Meet Textures: Spatial-Frequency Aligned Diffusion Features for Cross-Sparsity Correspondence
ESAM++: Efficient Online 3D Perception on the Edge
Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction
Generalizable Co-Salient Object Detection via Mixed Content-Style Modulation
UNICBench: UNIfied Counting Benchmark for MLLM
Generalizable Structure-Aware Keypoint Correspondence for Category-Unified 3D Single Object Tracking
Partial Weakly-Supervised Oriented Object Detection
From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning
PGA: Prior-free Generative Attack for Practical No-box Scenario
dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models
ArtiMuse: Fine-Grained Image Aesthetics Assessment with Joint Scoring and Expert-Level Understanding
Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning
SEA-Flow3D: Simplified, Efficient, and Accurate Scene Flow via Spatial Vector Sampling and Multi-scale Refinement
CASPA: Graph-Structured Concept Anchors for Modality-Agnostic Adaptation in Vision–Language Models
Property-Informed Diffusion-Based Text-to-Microstructure Generation
Beyond Text: Visual Description Assembly by Probabilistic Model for CLIP-based Weakly Supervised Semantic Segmentation
Cross-Slice Knowledge Transfer via Masked Multi-Modal Heterogeneous Graph Contrastive Learning for Spatial Gene Expression Inference
Adapting a Pre-trained Single-Cell Foundation Model to Spatial Gene Expression Generation from Histology Images
Frequency-Aware Affinity for Weakly Supervised Semantic Segmentation
CoopDiff: A Diffusion-Guided Approach for Cooperation under Corruptions
Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding
CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception
EmoDiffTalk:Emotion-aware Diffusion for Editable 3D Gaussian Talking Head
Improving Vision-language Models with Perception-centric Process Reward Models
GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking
Synthetic Knowledge-Guided Learning via Target-Region Gradients
Survive the 1001$^{st}$ Night: Interactive Physical Reasoning
An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning
UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register
Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation
Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection
SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System
$\oslash$ Source Models Leak What They Shouldn’t $\nrightarrow$ : Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization
GeoDexGrasp: Geometry-aware Generation for Data-efficient and Physics-plausible Dexterous Grasping
Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models
IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Senisng
Bootstrap Your Own AV-Proxies: Adaptive Contrastive and Prototype Learning for Audio-Visual Segmentation
When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models
ActivePolicy: Active Gaussian Reconstruction and Optimization Strategy Based on Global-Local Information Gain
GraspALL: Adaptive Structural Compensation from Luminance Variation for Robotic Garment Grasping in Any Low-Light Conditions
S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance
IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding
LiDAR Prompted Spatio-Temporal Multi-View Stereo for Autonomous Driving
Intrinsic Concept Extraction Based on Compositional Interpretability
SOTA: Self-adaptive Optimal Transport for Zero-Shot Classification with Multiple Foundation Models
ZeroIDIR: Zero-Reference Illumination Degradation Image Restoration with Perturbed Consistency Diffusion Models
Select, Hypothesize and Verify: Towards Verified Neuron Concept Interpretation
UAST: Unified Active Search and Tracking for Arbitrary Targets with UAVs
ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving
Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
Towards Streaming Referring Video Segmentation via Large Language Model
Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining
ORV: 4D Occupancy-centric Robot Video Generation
FedMOP: Achieving Enhanced Privacy and Performance in Federated Learning via Momentum Orthogonal Projection
All in One: Unifying Deepfake Detection, Tampering Localization, and Source Tracing with a Robust Landmark-Identity Watermark
RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment
ApET: Approximation-Error Guided Token Compression for Efficient VLMs
LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks
Scaling4D: Pushing the Frontier of Video Novel View Synthesis through Large-Scale Monocular Videos
PureCC: Pure Learning for Text-to-Image Concept Customization
Exemplar-Free Class Incremental Learning via Preserving Class-Discriminative Structure
Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining
DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis
RaPA: Enhancing Transferable Targeted Attacks via Random Parameter Pruning
Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement
Active Intelligence in Video Avatars via Closed-loop World Modeling
Personalized Federated Training of Diffusion Models with Privacy Guarantees
Disentanglement-wise Image Dehazing through Cross-Domain Manifold Consensus
VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-view Indoor 3D Object Detection
COT-FM: Cluster-wise Optimal Transport Flow Matching
Routing on Demand: DSNet for Efficient Progressive Point Cloud Denoising
Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention
Rethinking Diffusion Model-Based Video Super-Resolution: Leveraging Dense Guidance from Aligned Features
ViT$^3$: Unlocking Test-Time Training in Vision
Learning to Identify Out-of-Distribution Objects for 3D LiDAR Anomaly Segmentation
VLM4RSDet: Collaborative Optimization with Vision-Language Model for Enhancing Remote Sensing Object Detection
BALM: A Model-Agnostic Framework for Balanced Multimodal Learning under Imbalanced Missing Rates
GMT: Effective Global Framework for Multi-Target Multi-Camera Tracking
STUR3D: Spatio-Temporal Unified Representation Learning for 3D Object Detection
Beyond the Golden Data: Resolving the Motion-Vision Quality Dilemma via Timestep Selective Training
VideoSSR: Video Self-Supervised Reinforcement Learning
AutoRegressive Generation with B-rep Holistic Token Sequence Representation
BrepVGAE: Variational Graph Autoencoder with Unified Latent Representation for B-rep
Distribution-Aligned Multimodal Fusion for Robust Object Detection
Universal 3D Shape Matching via Coarse-to-Fine Language Guidance
Hybrid Robust Collaborative Perception with LiDAR-4D Radar Fusion under Adverse Weather Conditions
Occlusion-Aware SORT: Observing Occlusion for Robust Multi-Object Tracking
Mask to Align, Weight to Disambiguate: Reliable Unsupervised Cross-Modal Hashing with Masked-Weight Contrast
Learning from Noisy Supervision: A Denoising–Debiasing Framework for Weakly Supervised Video Anomaly Detection
Tackling Model Bias via Game-theoretic Multi-agent Collaboration Framework for Hateful Meme Classification
Scaling Spatial Intelligence with Multimodal Foundation Models
Prompt Yourself: Awakening Textual Semantics in 1D Visual Tokenizers
Next-Scale Autoregressive Models for Text-to-Motion Generation
ReLaX: Reasoning with Latent Exploration for Large Reasoning Models
Exploring the Underwater World Segmentation without Extra Training
Dance Across Shifts: Forward-Facilitation Continual Test-Time Adaptation through Dynamic Style Bridging
MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning
PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering
UniVBench: Towards Unified Evaluation for Video Foundation Models
Refracting Reality: Generating Images with Realistic Transparent Objects
MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis
Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction
STRNet: Visual Navigation with Spatio-Temporal Representation through Dynamic Graph Aggregation
DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training
AirSim360: A Panoramic Simulation Platform within Drone View
OmniZip: Learning a Unified and Lightweight Lossless Compressor for Multi-Modal Data
IncreFA: Breaking the Static Wall of Generative Model Attribution
GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
More than the Sum: Panorama-Language Models for Adverse Omni-Scenes
IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator–Critic Framework
Refaçade: Editing Object with Given Reference Texture
EnergyAction: Unimanual to Bimanual Composition with Energy-Based Models
MARIS: Marine Open-Vocabulary Instance Segmentation
Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting
SEA: Evaluating Sketch Abstraction Efficiency via Element-level Common-sense Visual Question Answering
Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation
SegMoTE: Token-Level Mixture of Experts for Medical Image Segmentation
Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection
DuetMerging: Synergizing Dynamic and Static Strategies for Mitigating Task Interference in Model Merging
Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference
From Remember to Transfer: Interpretable Open-World Reasoning in MLLMs
DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
Modeling Cross-vision Synergy for Unified Large Vision Model
UARE: A Unified Vision-Language Model for Image Quality Assessment, Restoration, and Enhancement
Bayesian Decomposition and Semantic Completion for Few-shot Semantic Segmentation
FreeArtGS: Articulated Gaussian Splatting Under Free-moving Scenario
Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision
Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion
$\alpha$Matte4K & $\mu$Matting: Dataset and Model for Ultra-Micro Precision Alpha Video Matting
LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs
Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data
CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection
DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration
Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation
StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars
ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars
Perceiving the Near, Reasoning the Distant: Coherent Long-Horizon Trajectory Prediction for Autonomous Driving
Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models
SVBench: Evaluation of Video Generation Models on Social Reasoning
Multi-Paradigm Collaborative Adversarial Attack Against Multimodal Large Language Models
Critical Patch-Aware Sparse Prompting with Decoupled Training for Continual Learning on the Edge
Beyond Myopic Alignment: Lookahead Optimization for Online Class-Incremental Learning
Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction
SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection
BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds
AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation
CompetitorFormer: Mitigating Query Conflicts for 3D Instance Segmentation via Competitive Strategy
Seeing What Matters: A Training-Free Self-Guided Framework for Multimodal Detail Perception and Reasoning
Temporal Equilibrium MeanFlow: Bridging the Scale Gap for One-Step Generation
Hint2Gen: Bridging Understanding and Generation via Code-structured Hints
Beyond Mimicry: Learning Whole-Body Human-Humanoid Interaction from Human-Human Demonstrations
CUBic: Coordinated Unified Bimanual Perception and Control Framework
S$^{2}$FT: Parameter-Efficient Fine-Tuning in Sparse Spectrum Domain
Fourier Angle Alignment for Oriented Object Detection in Remote Sensing
InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields
Enhancing Spatial Understanding in Image Generation via Reward Modeling
TRM-VLA: Temporal-Aware Chain-of-Thought Reasoning and Memorization for Vision-Language-Action Models
Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping
CAPT : Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment
Olbedo: An Albedo and Shading Aerial Dataset for Large-Scale Outdoor Environments
Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers
Scal3R: Scalable Test-Time Training for Feed-forward Large-Scale 3D Reconstruction
RAID: Retrieval-Augmented Anomaly Detection
TAlignDiff: Automatic Tooth Alignment assisted by Diffusion-based Transformation Learning
Learning to Track Instance from Single Nature Language Description
Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field
Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
FEAT: Fashion Editing and Try-On from Any Design
Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction
Transform to Transfer: Boosting Adversarial Attack Transferability on Vision-Language Pre-training Models
Concept-Aware Batch Sampling Improves Language-Image Pretraining
Confidence-Guided Multi-Scale Aggregation for Sparse-View High-Resolution 3D Gaussian Splatting
Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving
Dual-Agent Reinforcement Learning for Adaptive and Cost-Aware Visual–Inertial Odometry
When Do Models Actually Decide? Mapping the Layer-Wise Decision Timeline in Pretrained Neural Networks
Predicting Spatial Transcriptomics from Histology Images via High-Order Multi-Cell Interaction Modeling
FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision
Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing
Personalized Image Descriptions from Attention Sequences
OSA: Echocardiography Video Segmentation via Orthogonalized State Update and Anatomical Prior-aware Feature Enhancement
MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization
Dynamic Exposure Burst Image Restoration
Counterfactual VLA: Self-Reflective Vision-Language-Action Model with Adaptive Reasoning
Self-Consistency for LLM-based Motion Trajectory Generation and Verification
HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics
Time-Aware One Step Diffusion Network for Real-World Image Super-Resolution
SD-FSMIS: Adapting Stable Diffusion for Few-Shot Medical Image Segmentation
VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes
LaRP: Efficient Multi-View Inpainting with Latent Reprojection Priors
Federated Active Learning Under Extreme Non-IID and Global Class Imbalance
Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
ORPO: Orthogonal Panel-Relative Operators for Panel-Aware In-Context Image Generation
MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding
Enhancing Out-of-Distribution Detection with Extended Logit Normalization
Layered 4D-Rotor Gaussian Splatting: A Compressed Representation for Long Dynamic Scenes
PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models
RARE: Learn to RAnk and REtrieve for Monocular 3D Object Detection
Co-Me: Confidence Guided Token Merging for Visual Geometric Transformers
GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance
DocPrune: Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning
Joint Learning of General and Diverse Patterns with Mixture of Memory Experts for Weakly-Supervised Video Anomaly Detection
HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models
Protect to Adapt: Subspace-Constrained Adaptation with Ranked Negative Prompt Feedback for Few-Shot Action Recognition
Hilbert Curve-Based Attention Enabling Topology-Preserving Image Tensor Representation for Semantic Segmentation Network
Debiased Sample Selection for Learning with Noisy Labels
Modeling Spatiotemporal Neural Frames for High Resolution Brain Dynamic
SPDMark: Selective Parameter Displacement for Robust Video Watermarking
DNF-SR: Dual-Input and Negative-Aware Feature Fine-Tuning for Real-World Image Super-Resolution
Any4D: Unified Feed-Forward Metric 4D Reconstruction
OSMO: Open-vocabulary Self-eMOtion Tracking
Task-Aware Image Signal Processor for Advanced Visual Perception
DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution
Δynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos
GenColorBench: A Color Evaluation Benchmark for Text-to-Image Generation
RoadGIE: Towards A Global-Scale Aerial Benchmark for Generalizable Interactive Road Extraction
Language-guided Frequency Modulation for Large Vision-Language Models
The Geometry of Robustness: Optimizing Loss Landscape Curvature and Feature Manifold Alignment for Robust Finetuning of Vision-Language Models
Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
CARD: Correlation Aware Restoration with Diffusion
AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation
Frequency Switching Mechanism for Parameter-Efficient Multi-Task Learning
Mining Instance-Centric Vision–Language Contexts for Human–Object Interaction Detection
COPYLENS: Towards Copyrighted Characters Infringement Detection via Copyright-Aware Prompt Learning
Reclaiming Lost Text Layers for Source-Free Cross-Domain Few-Shot Learning
Reevaluating the Intra-modal Misalignment Hypothesis in CLIP
PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting
3D Gaussian Splatting from unposed Spike Stream
MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation
GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving
Alert-CLIP: Abnormality-aware Latent-Enhanced Representation Tuning of CLIP for Video Anomaly Detection
An Efficient Token Compression Framework for Visual Object Tracking
Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding
End-to-End Hyper-Relational Information Extraction for Engineering Diagrams via Dynamically Tokenized Relation Transformer
NS-Diff: Fluid Navier–Stokes Guided Video Diffusion via Reinforcement Learning
Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision
CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration
From Pixel to Precision: Enhancing Handwritten Mathematical Expression Recognition with Image-Level Reward
Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation
WildPose: A Unified Framework for Robust Pose Estimation in the Wild
Rank-Guided Pseudo-Bias Learning for Robust Black-Box Adaptation
ForeHOI: Feed-forward 3D Object Reconstruction from Daily Hand-Object Interaction Videos
Beyond Strict Pairing: Arbitrarily Paired Training for High-Performance Infrared and Visible Image Fusion
ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology
Guiding Diffusion Models with Semantically Degraded Conditions
DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction
Deep Feature Deformation Weights
Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy
Inferring Compositional 4D Scenes without Ever Seeing One
Open the Motion Door: Atomic Motion Decomposition and Recomposition for Open-Vocabulary Motion Generation
Bridging RGB and Hematoxylin Components: An Interleaved Guidance and Fusion Framework for Point Supervised Nuclei Segmentation
$A^2$GC: $A$symmetric $A$ggregation with Geometric Constraints for Locally Aggregated Descriptors
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
AREA3D: Active Reconstruction Agent with Unified Feed-Forward 3D Perception and Vision-Language Guidance
WeaveTime: Streaming from Earlier Frames into Emergent Memory in VideoLLMs
Learning complete and explainable visual representations from itemized text supervision
LiveGesture: Streamable Co-Speech Gesture Generation Model
Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization
It Takes Two: A Duet of Periodicity and Directionality for Burst Flicker Removal
Foundry: Distilling 3D Foundation Models for the Edge
Revisiting Learning with Noisy Labels: Active Forgetting and Noise Suppression
InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization
ReGenHOI: Unifying Reconstruction and Generation for 3D Human–Object Interaction Understanding
Seeing Conversations: Communication Context Identification in Egocentric Video
Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation
$\phi$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models
UniSER: A Foundation Model for Unified Soft Effects Removal
Condensed Test-Time Adaptation of VLMs for Action Recognition
CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space
Temporal Imbalance of Positive and Negative Supervision in Class-Incremental Learning
2D-LFM: Lifting Foundation Model without 3D supervision
MatE: Material Extraction from Single-Image via Geometric Prior
Real-Time Neural Video Compression with Unified Intra and Inter Coding
Mirror Illusion Art
UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying
InterPrior: A Scalable Motion Prior for Physics-Based Human-Object Interactions
Mapping Networks
PolarGuide-GSDR: 3D Gaussian Splatting Driven by Polarization Priors and Deferred Reflection for Real-World Reflective Scenes
Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning
Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization
DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers
$L^{2}DGS$: Low-Light Dynamic Gaussian Splatting
GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport
Exemplar-Free Continual Learning for State Space Models
Grid Distillation: Compositional Image Distillation via Structured Generative Grids
Towards Training-free Scene Text Editing
MoEActok: A MoE-based Action Tokenizer for Vision-Language-Action Models
SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation
MusicInfuser: Making Video Diffusion Listen and Dance
INSID3: Training-Free In-Context Segmentation with DINOv3
SparseSplat: Towards Applicable Feed-Forward 3D Gaussian Splatting with Pixel-Unaligned Prediction
Improving Controllable Generation: Faster Training and Better Performance via x0-Supervision
QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition
ExpPortrait: Expressive Portrait Generation via Personalized Representation
Delta Rectified Flow Sampling for Text-to-Image Editing
SketchFaceGS: Real-Time Sketch-Driven Face Editing and Generation with Gaussian Splatting
Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation
AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization
MeteorPred: A Meteorological Multimodal Large Model and Dataset for Severe Weather Event Prediction
Agile Deliberation: Concept Deliberation for Subjective Visual Classification
Flowception: Temporally Expansive Flow Matching for Video Generation
FreeForm: Reduced-Order Deformable Simulation from Particle-Based Skinning Eigenmodes
Exact-GS: Mathematically Rigorous and Accurate 3D Gaussian Splatting for 3D X-ray Reconstruction
Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation
Computational Speckle Pattern Interferometry
CountGD++: Generalized Prompting for Open-World Counting
Black-Box Domain Adaptation for Object Detection with Retention-Driven Knowledge Compression
TDATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment
SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning
GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views
Reinforcing Structured Chain-of-Thought for Video Understanding
FastRef: Fast Prototype Refinement for Few-shot Industrial Anomaly Detection
Defending Unauthorized Model Merging via Dual-Stage Weight Protection
SpatialDiff: 3D-Aware Object Movement via Implicit Spatial Modeling
TrajTok: Learning Trajectory Tokens enables better Video Understanding
Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering
TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification
DMGD: Train-Free Dataset Distillation with Semantic-Distribution Matching in Diffusion Models
Vision-Speech Models: Teaching Speech Models to Converse about Images
CRAFT-LoRA: Content-Style Personalization via Rank-Constrained Adaptation and Training-Free Fusion
UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression
SunFaded: Illumination-Aware Gaussian Splatting for Dark Scenes with Camera-Mounted Active Lighting
SpiderCam: Low-Power Snapshot Depth from Differential Defocus
Ultra-Fast Neural Video Compression
Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation
Cycle-Consistent Tuning for Layered Image Decomposition
Free-Grained Hierarchical Visual Recognition
FlashCap: Millisecond-Accurate Human Motion Capture via Flashing LEDs and Event-Based Vision
SceneTok: A Compressed, Diffusable Token Space for 3D Scenes
OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion
Hear you are: Teaching LLMs Spatial Reasoning with Vision and Spatial Sound
CoRiM: Conflict-driven Risk Minimization for Dynamic Multimodal Fusion
NeuROK: Generative 4D Neural Object Kinematics
SegMo: Co-Designing Content-Aware Sparsity and Locally-Cohesive Segment Parallelism for Efficient VLM Inference
DIMOS: Disentangling Instance-level Moving Object Segmentation
Hidden Monotonicity: Explaining Deep Neural Networks via their DC Decomposition
Beyond Single Images: A Comprehensive Benchmark for Album-Level Vision-Language Understanding
UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in RL
SSM-Aware Token-Efficient VMamba via Adaptive Patch Pruning and Merging for Person Re-Identification
AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects
Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark
Concept-Guided Fine-tuning: Steering ViTs away from Spurious Correlations to Improve Robustness
S$^2$AM3D: Scale-controllable Part Segmentation of 3D Point Clouds
Tracking by Predicting 3-D Gaussians Over Time
Match-and-Fuse: Consistent Generation from Unstructured Image Sets
EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation
LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment
PhotoFramer: Multi-modal Image Composition Instruction
Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers
The Surprising Effectiveness of Noise Pretraining for Implicit Neural Representations
MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation
NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
High-Fidelity Mobile Avatars with Pruned Local Blendshapes
Event-Illumination Collaborative Low-light Image Enhancement with a High-resolution Real-world Dataset
Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer
Vinedresser3D: Towards Agentic Text-guided 3D Editing
AURA: Multi-modal Shared Autonomy for Urban Navigation
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models
Diffusion-Based Makeup Transfer with Facial Region-Aware Makeup Features
BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation
Learning by Analogy: A Causal Framework for Compositional Generalization
Diff4Splat: Repurposing Video Diffusion Models for Dynamic Scene Generation
PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning
MAPo : Motion-Aware Partitioning of Deformable 3D Gaussian Splatting for High-Fidelity Dynamic Scene Reconstruction
Clair Obscur: an Illumination-Aware Method for Real-World Image Vectorization
Detecting Compressed AI-Generated Images via Phase Spectrum Robustness
UniPercept: A Unified Diffusion Model for Generalizable Visual Perception
Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning
SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving
HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning
Efficient unrolled networks for large-scale 3D inverse problems
MoBind: Motion Binding for Fine-Grained IMU–Video Pose Alignment
OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar
How Much 3D Do Video Foundation Models Encode?
Transition Matching Distillation for Fast Video Generation
Multi-View Hierarchical Alignment Learning for Spatial Transcriptomics
Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation
Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation
TIGER: A Unified Framework for Time, Images and Geo-location Retrieval
Photo-Guided Tooth Segmentation on 3D Oral Scan Model
VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale
LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction
DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers
Fuel Gauge: Estimating Chain-of-Thought Length Ahead of Time in Large Multimodal Models
MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts
Pip-Stereo: Progressive Iterations Pruner for Iterative Optimization based Stereo Matching
Geometry-Aligned and Anomaly-Aware Reconstruction for 3D Anomaly Detection
Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection
STAR-R1: Multi-View Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs
Back to Basics: Let Denoising Generative Models Denoise
Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video
AsymLoc: Towards Asymmetric Feature Matching for Efficient Visual Localization
Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection
Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation
Few-Shot Hybrid Incremental Learning:Continually Learning under Data Scarcity and Task Uncertainty
MultiCrafter: High-Fidelity Multi-Subject Generation via Disentangled Attention and Identity-Aware Preference Alignment
GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings
Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling
Agentic Learner with Grow-and-Refine Multimodal Semantic Memory
Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory
TokenSplat: Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction
EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images
Simple but Effective Triplet-Based Compression Strategies for Compact Visual Localization
Diffusion Mental Averages
Rethinking 2D-3D Registration: A Novel Network for High-Value Zone Selection and Representation Consistency Alignment
Video-CoE: Reinforcing Video Event Prediction via Chain of Events
Stable Mean Flow: Lyapunov-Inspired One-Step Flow Matching
PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction
LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds
Beyond Heuristic Prompting: A Concept-Guided Bayesian Framework for Zero-Shot Image Recognition
Restore Text First, Enhance Image Later: Two-Stage Scene Text Image Super-Resolution with Glyph Structure Guidance
Perceptual Neural Video Compression with Color Separation and Rank Chain
LRHDR: Learning Representation-enhanced HDR Video Reconstruction
MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision
UniDAC: Universal Metric Depth Estimation for Any Camera
Diffusion Probe: Generated Image Result Prediction Using CNN Probes
Seeing without Pixels: Perception from Camera Trajectories
CaliTex: Geometry-Calibrated Attention for View-Coherent 3D Texture Generation
Enhancing Continual Learning of Vision-Language Models via Dynamic Prefix Weighting
Let VLMs Grade Their Own Thoughts: A Self-Quantification Approach to Reasoning-Aware Reward Modeling
Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection
DiffuView: Multi-View Diffusion Pretraining for 3D Aware Robotic Manipulation
Learning Eigenstructures of Unstructured Data Manifolds
An Empirical Study on How Video-LLMs Answer Videos Questions
EMR-Diff: Edge-aware Multimodal Residual Diffusion Model for Hyperspectral Image Super-resolution
Geometrically-Constrained Agent for Spatial Reasoning
From Feature Learning to Spectral Basis Learning: A Unifying and Flexible Framework for Efficient and Robust Shape Matching
Bridging Privacy and Provenance: Traceable Virtual Identity Generation
Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation
Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision
Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity
FedRG: Unleashing the Representation Geometry for Federated Learning with Noisy Clients
PHAC: Promptable Human Amodal Completion
A Training-Free Style-Personalization via SVD-Based Feature Decomposition
WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling
Retrieve-to-Restore: Efficient All-in-One Image Restoration with a Retrieval-Based Degradation Bank
A Mixed Diet Makes DINO an Omnivorous Vision Encoder
TopoCL: Topological Contrastive Learning for Medical Imaging
Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation
Memory-Efficient Fine-Tuning Diffusion Transformer via Dynamic Patch Sampling and Block Skipping
HUMAPS-4D : A Multimodal Dataset for HUman Motion Analysis with Physiological and Semantic informations
DeX-Portrait: Disentangled and Expressive Portrait Animation via Explicit and Latent Motion Representations
MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation
Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment
WhisperNet: A Scalable Solution for Bandwidth-Efficient Collaboration
When Local Rules Create Global Order: Self-Organized Representation Learning for Latent Diffusion Models
Towards Intrinsic-Aware Monocular 3D Object Detection
E$^2$-SCI: Elastic Edge–Cloud Speculative Decoding via Credit Inertia
Causality in Video Diffusers is Separable from Denoising
MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis
FlowFM: Advancing Dark Optical Flow Estimation with Flow Matching
X-WIN: Building Chest Radiograph World Model via Predictive Sensing
Unified Number-Free Text-to-Motion Generation Via Flow Matching
Towards Hierarchical 3D Spatial Understanding in Vision-Language Models
DCoAR: Deep Concept Injection into Unified Autoregressive Models for Personalized Text-to-Image Generation
PointThinker: Point-Incentivized Parallel Thinking for Multimodal Large Language Model
SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking
C-LaV: Conditional Latent Velocity Field Denoising for Weather-Robust LiDAR Place Recognition
Linear Image Generation by Synthesizing Exposure Brackets
UniT: Unified Multimodal Chain-of-Thought Test-time Scaling
Cinematic Audio Source Separation Using Visual Cues
Energy-GS: Image Energy-guided Pose Alignment Gaussian Splatting with redesigned pose gradient flow
Geometric-Photometric Event-based 3D Gaussian Ray Tracing
CAST: Context-Aware Dynamic Latent Space Transformation for Interactive Text-to-Image Retrieval
A Self-Conditioned Representation Guided Diffusion Model for Realistic Text-to-LiDAR Scene Generation
Accelerating Streaming Video Understanding via Hierarchical Token Compression
MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label
Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding
DFD-HR: Generalizable Deepfake Detection via Hierarchical Routing Learning
PackUV: Packed Gaussian UV Maps for 4D Volumetric Video
SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM
Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction
Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens
MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator
Causal Motion Diffusion Models for Autoregressive Motion Generation
Unified Multimodal Models as Auto-Encoders
CD-Buffer: Complementary Dual-Buffer Framework for Test-Time Adaptation in Adverse Weather Object Detection
Joint Spectral Image Reconstruction and Semantic Segmentation with Cooperative Unfolding
DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models
It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models
MR-RAG: Multimodal Relevance-Aware Retrieval-Augmented Generation for Medical Visual Question Answering
HamiPose: Hamiltonian Optimization for Unsupervised Domain Adaptive Pose Estimation
Score2Instruct: Scaling Up Video Quality-Centric Instructions via Automated Dimension Scoring
The Missing Point in Vision Transformers for Universal Image Segmentation
Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration
Towards Robust Multimodal Large Language Models Against Jailbreak Attacks
JoPPO: Hierarchical Photography Assessment via Contrastive Joint Conditional Probabilistic Reinforcement Learning
Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
Dual-Level Confidence based Implicit Self-Refinement for Medical Visual Question Answering
Revisiting Pose Sensitivity in Splat-based Computed Tomography under Sparse-view Reconstruction
SAG-GNN: Semantic-Aware Guided GNN for Descriptor-Free 2D-3D Matching
Towards Robust Vision Transformers: Path Dependency Analysis and a Simple Two-Stage Adversarial Training
X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection
Direction-aware 3D Large Multimodal Models
RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs
FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement
FISHuman: Fine-grained Single-image 3D Human Reconstruction via Multi-view 4D Remeshing
SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation
Elastic Weight Consolidation Done Right for Continual Learning
OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness
Exploring Conditions for Diffusion models in Robotic Control
SkillSight: Efficient First-Person Skill Assessment with Gaze
GeoRK2: Geometry-Guided Runge–Kutta Integration for Diffusion Transformer Acceleration
Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities
Statistical Characteristic-Guided Denoising for Rapid High-Resolution Transmission Electron Microscopy Imaging
SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting
Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach
Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization
TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation
SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation
Yume1.5: A Text-Controlled Interactive World Generation Model
PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery
EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing
From Sketch to Fresco: Efficient Diffusion Transformer with Progressive Resolution
MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction
SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation
UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits
Not All Birds Look The Same: Identity-Preserving Generation For Birds
One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework
Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers
ProxyFL: A Proxy-Guided Framework for Federated Semi-Supervised Learning
Anatomical Domain Shifts: Test-time Heterogeneous Adaptation for 3D Human Pose Prediction
OntoAug: Rethinking Generative Data Augmentation via Ontology Guidance
SharpTimeGS: Sharp and Stable Dynamic Gaussian Splatting via Lifespan Modulation
CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation
Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation
Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs
Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models
Text-Image Conditioned 3D Generation
DA-Mamba: Learning Domain-Aware State Space Model for Global-Local Alignment in Domain Adaptive Object Detection
Erasing Thousands of Concepts: Towards Scalable and Practical Concept Erasure for Text-to-Image Diffusion Models
PDD: Manifold-Prior Diverse Distillation for Medical Anomaly Detection
Hugging Visual Prompt and Segmentation Tokens: Consistency Learning for Fine-Grained Visual Understanding in MLLMs
TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting
ZINA: Multimodal Fine-grained Hallucination Detection and Editing
Collaborative Multi-Mode Pruning for Vision-Language Models
Lifting Unlabeled Internet-scale Data for 3D Scene Understanding
EasyV2V: A High-quality Instruction-based Video Editing Framework
AMusE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding
$\textit{4DSurf}$: High-Fidelity Dynamic Scene Surface Reconstruction
MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry
C-GenReg: Training-Free 3D Point Cloud Registration by Multi-View-Consistent Geometry-to-Image Generation with Probabilistic Modalities Fusion
G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval
WarpTracker: Tracking by Warping instead of Correlation
MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture
Language-Free Generative Editing from One Visual Example
Physically Inspired Gaussian Splatting for HDR Novel View Synthesis
Progressive mask distillation for self-supervised video representation
Semantic Context Matters: Improving Conditioning for Autoregressive Models
LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis
AceTone: Bridging Words and Colors for Conditional Image Grading
Ultra-Low Bitrate Perceptual Image Compression with Shallow Encoder
MotionCrafter: Repurposing Video Generators for Dense Geometry and Motion Reconstruction
Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation
Controllable Federated Prompt Learning at Test Time
Query2Uncertainty: Robust Uncertainty Quantification and Calibration for 3D Object Detection under Distribution Shift
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization
Delving Aleatoric Uncertainty in Medical Image Segmentation via Vision Foundation Models
Semantic Derivative Flow: Graph-Guided Diffusion for Controllable Instance Interactions
HDR-VLM: HDR-Domain Adaptation of VLMs and Preference-Aligned Quality Assessment for HDR Video Color Grading
Decouple Your Discovery and Memory in Continual Generalized Category Discovery
Elucidating the Design Space of Arbitrary-Noise-Based Diffusion Models
Minimal Constraint Relaxation for Multiview Autocalibration
AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models
Point4Cast: Streaming Dynamic Scene Reconstruction and Forecasting
Masking Teacher and Reinforcing Student for Distilling Vision-Language Models
AToken: A Unified Tokenizer for Vision
IDESplat: Iterative Depth Probability Estimation for Generalizable 3D Gaussian Splatting
EcoAlign: An Economically Rational Framework for Efficient LVLM Alignment
Fed-ADE: Adaptive Learning Rate for Federated Post-adaptation under Distribution Shift
UniGame: Turning a Unified Multimodal Model Into Its Own Adversary
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
Interpretable and Steerable Concept Bottleneck Sparse Autoencoders
SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time
Omni-Attribute: Open-vocabulary Image Attribute Encoder for Visual Disentanglement and Composition
Intrinsic Image Fusion for Multi-View 3D Material Reconstruction
Relightful Video Portrait Harmonization
Efficient Real-Time Raw-to-Raw Denoising for Extreme Low-Light Ultra HD Video on Mobile Devices
Describe Anything Anywhere At Any Moment
ReCoFuse: Ultra-Robust Image Fusion via Restorative Multi-Modal Diffusion Reciprocal Coupling
Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos
Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding
Cross-modal Representation Learning for Diffusion-generated Image Detection
HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles
CA-LoRA: Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation
KnowVal: A Knowledge-Augmented and Value-Guided Autonomous Driving System
StreamReady: Learning *What* to Answer and *When* in Long Streaming Videos
Training-free Motion Factorization for Compositional Video Generation
Rosetta Stone For Unified MLLMs: A unified tokenizer to decipher understanding and generation
Let it Snow! Animating 3D Gaussian Scenes with Dynamic Weather Effects via Physics-Guided Score Distillation
VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments
Spatial-Frequency Collaborative Learning for Occluded Visible-Infrared Person Re-Identification
Investigating Self-Supervised Representations for Audio-Visual Deepfake Detection
R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space
CoFiDA-M: Concept-Aware Feature Modulation for Cross-Domain Adaptation with Image-Only Inference
Consensus vs. Controversy: Mapping the Decision Space Where Architectures Diverge
Human Interaction-Aware 3D Reconstruction from a Single Image
Stitch-a-Demo: Creating Video Demonstrations from Multistep Descriptions
Smoothing the Score Function to Enhance Generalization in Diffusion Models
RenderFlow: Single-Step Neural Rendering via Flow Matching
GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics
MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark
AnthroTAP: Learning Point Tracking with Real-World Motion
DetectSCI: Toward Object-Guided ROI Reconstruction for High-Resolution Video Snapshot Compressive Imaging
Geometric Neural Distance Fields for Learning Human Motion Priors
Edges Compete for Trust: Group Relative Edge Optimization for Building Reconstruction from Point Clouds
Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild
Enhancing Descriptive Captions with Visual Attributes for Multimodal Perception
Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning
Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator
GeoFree-CoSeg: Unsupervised Point Cloud-Image Cross-Modal Co-Segmentation Without Geometric Alignment
ID-Sim: An Identity-Focused Similarity Metric
SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model
PACT: Phase-Like Transition Constraints in Adapter-Based Continual Learning of Vision-Language Models
GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials
Scalable Multi-View Subspace Clustering with Tensorized Anchor Guidance
A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods
Lipschitz Optimization for Formal Verification of Homographies
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
EpiAgent: An Agent-Centric System for Ancient Inscription Restoration
Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images
Lumosaic: Hyperspectral Video via Active Illumination and Coded-Exposure Pixels
Proxy-GS: Unified Occlusion Priors for Training and Inference in Structured 3D Gaussian Splatting
Vision-Language Model Guided Source-Free Domain Adaptation via Optimal Transport
DyaDiT: A Multi-Modal Diffusion Transformer for Socially-Aware Dyadic Gesture Generation
UniVerse: A Unified Modulation Framework for Segmentation-Free, Disentangled Multi-Concept Personalization
MeToM: Metadata-Guided Token Merging for Efficient Video LLMs
ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion
CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation
Human-Centric Multi-Exposure Fusion: Benchmark and Bi-level Cognition Distillation Framework
Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning
Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Uncertainty Estimation
AE2VID: Event-based Video Reconstruction via Aperture Modulation
Leveraging Class Distributions in CLIP for Weakly Supervised Semantic Segmentation
Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos
Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers
FloVerse: Floor Plan-Guided Multi-Modal Navigation
Sparse–View Localization via Online Neural 3D Regression
Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation
PvP: Data-Efficient Humanoid Robot Learning with Proprioceptive-Privileged Contrastive Representations
Dexterous World Models
Recurrent Video Masked Autoencoders
HFR and HDR Video from Multi-Attenuated Spikes Using a Rapidly Rotating SpokeND Filter
ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands
AdapTok: Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space
PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention
SAR2Net: Learning Spatially Anchored Representations for Retrieval-Guided Cross-Stain Alignment
POLAR: A Portrait OLAT Dataset and Generative Framework for Illumination-Aware Face Modeling
Prospective Dynamic 3D MRI Reconstruction via Latent-Space Motion Tracking from Single Measurement
Domain-Skewed Federated Learning with Feature Decoupling and Calibration
Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models
Next-Scale Prediction: A Self-Supervised Approach for Real-World Image Denoising
BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections
MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding
E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction
Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods
Vocabulary Scaling Law : Tuning Open-vocabulary Predictors for Their Openness
ActiveAD: Planning-Oriented Active Learning for End-to-End Autonomous Driving
Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared
SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images
Perception Characteristics Distance: Measuring Stability and Robustness of Perception System in Dynamic Conditions under a Certain Decision Rule
Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes
TF-CADE: Foreground-Concentrated Text-Video Alignment for Zero-Shot Temporal Action Detection
Intervention-Aware Multiscale Representation Learning from Imaging Phenomics and Perturbation Transcriptomics
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
From Scale to Speed: Adaptive Test-Time Scaling for Image Editing
FoSS: Modeling Long-Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier–State Space Integration
ARES: Unifying Asymmetric RGB-Event Stereo for Probabilistic Scene Flow Estimation
LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration
Spectral Mixture-of-Experts for Continual Learning
GeoRelight: Learning Joint Geometrical Reconstruction and Relighting with Flexible Multi-Modal Diffusion Transformers
Disco-GS: Gaussian Splatting in Dynamic Color Lighting
Event Stream Filtering via Probability Flux Estimation
MSCD-GS: Motion-Separated Cooperative Deblurring Dynamic Reconstruction via Gaussian Splatting
ELiC: Efficient LiDAR Geometry Compression via Cross-Bit-depth Feature Propagation and Bag-of-Encoders
3D Gaussian Splatting with Self-Constrained Prior for High Fidelity Surface Reconstruction
SelecTKD: Selective Token-Weighted Knowledge Distillation for LLMs
Voxify3D: Pixel Art Meets Volumetric Rendering
PhyCritic: Multimodal Critic Models for Physical AI
FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control
CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models
MOMO: Mars Orbital MOdel Foundation Model for Mars Orbital Applications
Learning Hierarchical Hyperbolic Mixture Model for Part-aware 3D Generation
MTA: Multimodal Task Alignment for BEV Perception and Captioning
POGA: Paraphrased and Oppositional Graph Alignment for Fine-Grained Cross-Modal Retrieval
Explaining Object Detectors via Collective Contribution of Pixels
ViHOI: Human-Object Interaction Synthesis with Visual Priors
ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery
PAF: Perturbation-Aware Filtering for Open-Set Semi-Supervised Learning
T2SGrid: Temporal-to-Spatial Gridification for Video Temporal Grounding
SonoWorld: From One Image to a 3D Audio-Visual Scene
PosterReward: Unlocking Accurate Evaluation for High-Quality Graphic Design Generation
ST4R-Splat: Spatio-Temporal Referring Segmentation in 4D Gaussian Splatting
Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation
SelfHVD: Self-Supervised Handheld Video Deblurring
Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling
FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection
Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models
Residual Primitive Fitting of 3D Shapes with SuperFrusta
Erasing Invisible Watermarks via Novel View Synthesis
White-Balance First, Adjust Later: Cross-Camera Color Constancy via Vision-Language Evaluation
DreamSR: Towards Ultra-High-Resolution Image Super-Resolution via a Receptive-Field Enhanced Diffusion Transformer
Leveraging Multispectral Sensors for Color Correction in Mobile Cameras
Streamlined Open-Vocabulary Human-Object Interaction Detection
Explicit Recovery Behaivor for Diffusion Policies
PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing
Agentic Retoucher for Text-To-Image Generation
Adaptive Capacity Autoregressive Visual Tracking
GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision
Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation
Multi-speaker Attention Alignment for Multimodal Social Interaction
ALLNet: Multi-task Dense Prediction for Degraded Images
Velox: Learning Representations of 4D Geometry and Appearance
DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving
Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout
One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution
RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video
Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning
Multi-Scale Gradient-Guided Unrolling Architecture with Adaptive Mamba for Compressive Sensing
DVGT: Visual Geometry Transformer for Autonomous Driving
VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
Efficient All-Pairs Correlation Volume Sampling for Optical Flow Estimation
Graph Attention Prototypical Network for Robust Few-Shot Classification
Gaussian-Mixture Latent Flow for Stochastic 3D Human Motion Prediction
Convexity-Aware Noise Calibration: A Self-Supervised Framework for Noise-Level-Unknown Image Denoising
FG-portrait: 3D Flow Guided Editable Portrait Animation
TVHighlights: LLM-Guided Human-Free Collaborative Training for Video Highlight Detection in Movies and TV Dramas
BiGain: Unified Token Compression for Joint Generation and Classification
Think Visually, Reason Textually: Vision-Language Synergy in Abstract Reasoning
QUANTIPHY: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models
Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection
Otil: Accelerating Diffusion Model Inference via Communication-Efficient Multi-GPU Parallelism
SG-LoRA: Semantic-guided LoRA Parameters Generation
RAG-TP: A General Framework for Vehicle Trajectory Prediction via Retrieval-Augmented Generation
Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long-Video Understanding
Local Precise Refinement: A Dual-Gated Mixture-of-Experts for Enhancing Foundation Model Generalization against Spectral Shifts
Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models
SIMPLEPOSTER: A SIMPLE BASELINE FOR PRODUCT POSTER GENERATION
Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning
MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy
Interactive Episodic Memory with User Feedback
Alternative Reprogramming for Service Models
VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation
A Polynomial Chaos Framework for Causal Discovery in Nonlinear Uncertain Systems
SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval
Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection
ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation
Envisioning the Future, One Step at a Time
From Softmax to Dirichlet: Evidential Learning for Semi-supervised Semantic Segmentation
Face-Guided Sentiment Boundary Enhancement for Weakly-Supervised Temporal Sentiment Localization
VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
TrafficAlign: Aligning Large Language Models for Traffic Scenario Generation
MRI Contrast Enhancement Kinetics World Model
HiDRA: Hierarchical Degradation Representation and Adaptation with Generative Priors for Enhancing Infrared Vision
Global-Aware Edge Prioritization for Pose Graph Initialization
A Geometric Algebra-Informed 3DGS Framework for Wireless Channel Prediction
OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens
MIBURI: Towards Expressive Interactive Gesture Synthesis
UniCorn: Unified Correspondence Transformer Across 2D and 3D
Stealing Split Learning Bottom Models by Recovering Embedding Geometry
Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning
Divide and Conquer: Object Co-occurrence Helps Mitigate Simplicity Bias in OOD Detection
Latent Diffusion Inversion Requires Understanding the Latent Space
Captain Safari: A Real-time World Engine
AXG-Reasoner: Error Detection and Explanation in Long Task Videos with Vision–Language Models
MeshFlow: Efficient Artistic Mesh Generation via MeshVAE and Flow-based DiTs
Lenses: Toward Polysemous Vision–Language Understanding
ArchSym: Detecting 3D-Grounded Architectural Symmetries in the Wild
MCHDoc: A Comprehensive Benchmark for Reading Multi-Carrier Chinese Historical Documents
GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding
Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers
MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
Gyro-based Deep Video Deblurring
Global Structure-from-Motion Meets Feedforward Reconstruction
Every Error has Its Magnitude: Asymmetric Mistake Severity Training for Multiclass Multiple Instance Learning
OccAny: Generalized Unconstrained Urban 3D Occupancy
FlashIn: Fast and Accurate Image Inversion for Real-time Image Editing
Premier: Personalized Preference Modulation with Learnable User Embedding in Text-to-Image Generation
MARSS: Radar Semantic Segmentation via Modular Attention and State Space Models
Guiding Token-Sparse Diffusion Models
Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
Learning Long-term Motion Embeddings for Efficient Kinematics Generation
CrossAgent: Bridging Cross-level Actions into One Agentic Model via Reinforcement Learning
Is Bin Generation Indispensable? A Bin-Generation-Free Dataset Quantization via Semantic Perspective
Revisiting F-measure Optimization in Multi-Label Classification: A Sampling-based Approach
VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation
Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval
VGA:Empowering Aerial-Ground Localization by Visual Geometry Alignment
Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots
CausalLens: Sensitivity-Guided Multi-Head Causal Intervention for Hallucination Mitigation in Large Vision-Language Models
Forecast the Principal, Stabilize the Residual: Subspace-Aware Feature Caching for Diffusion Transformers
SAT-RRG: LLM-Guided Self-Adaptive Training for Radiology Report Generation with Token-Level Push–Pull Optimization
Beyond Generation: Advancing Image Editing Priors for Depth and Normal Estimation
FabricGen: Microstructure-Aware Woven Fabric Generation
Reliable Clustering Number Estimation for Contrastive Multi-View Clustering
Lighting in Motion: Spatiotemporal HDR Lighting Estimation
ORION: ORthonormal Text Encoding for Universal VLM AdaptatION
EReCu: Pseudo-label Evolution Fusion and Refinement with Multi-Cue Learning for Unsupervised Camouflage Detection
PixelDiT: Pixel Diffusion Transformers for Image Generation
ResCa: Residual Caching for Diffusion Transformers Acceleration
ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction
Camouflage-aware Image-Text Retrieval via Expert Collaboration
RS-SSM: Refining Forgotten Specifics in State Space Model for Video Semantic Segmentation
MARCO: Navigating the Unseen Space of Semantic Correspondence
A Unified Framework for Knowledge Transfer in Bidirectional Model Scaling
Turning Pre-Trained Vision Transformers into End-to-End Histopathology Whole Slide Image Models for Survival Prediction
Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers
Fast Reasoning Segmentation for Images and Videos
Agent4FaceForgery: Multi-Agent LLM Framework for Realistic Face Forgery Detection
SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning
PhysIR-Splat: Physically Consistent Thermal Infrared Radiative Transfer in 3D Gaussian Splatting
Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments
Convolutional Neural Networks Driven by Content Similarity
TGSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion
CaptionQA: Is Your Caption as Useful as the Image Itself?
GPFlow: Gaussian Prototype Probability Flow for Unsupervised Multi-Modal Anomaly Detection
MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection
SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer
A combination of noise and bilateral filters achieve supralinear and scalable adversarial robustness in CNNs
CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image
Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion
Dual Graph Regularized Deep Unfolding Network for Guided Depth Map Super-resolution
Latent Visual Reasoning
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
PerpetualWonder: Long-horizon Action-conditioned 4D Scene Generation
TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size
High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation
PhyOceanCast: Global Ocean Forecasting with Physics-Informed Diffusion
AdaSVD: Singular Value Decomposition with Adaptive Mechanisms for Large Multimodal Models
ActiveGrasp: Information-Guided Active Grasping with Calibrated Energy-based Model
Animator-Centric Skeleton Generation on Objects with Fine-Grained Details
HandWorld: Hand-Centric Unified Video Action Generation
BOP-ASK: Object-Interaction Reasoning for Vision-Language Models
MambaSIC: Mamba-based Stereo Image Compression with Bi-directional Multi-reference Entropy Model
Object-WIPER: Training-Free Object and Associated Effect Removal in Videos
Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision–Language Understanding
TUDSR: Twice Upsampling-Diffusion for Higher Super-Resolution
CLEX: Complementary Label Exchange Learning for Noisy Facial Expression Recognition
LitePT: Lighter Yet Stronger Point Transformer
Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction
MangoBench: A Benchmark for Multi-Agent Goal-Conditioned Offline Reinforcement Learning
FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
Learning from Oblivion: Predicting Knowledge-Overflowed Weights via Retrodiction of Forgetting
Adaptive Anisotropic Gaussian Splatting for Multi-contrast MRI Arbitrary-Scale Super-Resolution with Anatomy Guidance
LeapAlign: Post-training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction
MS-Temba: Multi-Scale Temporal Mamba for Understanding Long Untrimmed Videos
Vision-Oriented Lightweight Neural Architecture Search with Budget-Adaptive Evaluation
FVGen: Scaling 3D Scene Datasets with Certainty-Aware Free-View Generation from Scene Geometry Reconstruction
ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes
DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning
MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation
D2Dewarp: Dual Dimensions Geometric Representation Learning Based Document Image Dewarping
From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs
Beyond Euclidean Gossip: KL-Barycentric Consensus on Heterogeneous and Imbalanced Images
Coded-E2LF: Coded Aperture Light Field Imaging from Events
Mimic Human Cognition, Master Multi-Image Reasoning: A Meta-Action Framework for Enhanced Visual Understanding
Learning to Generate Highly Dynamic Videos using Synthetic Motion Data
ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement
R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection
A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models
Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing
From Corners to Fiducial Tags: Revisiting Checkerboard Calibration for Event Cameras
Cross-View Distillation and Adaptive Masking for Incomplete Multi-View Multi-Label Classification
StyleDoctor: Towards Specialist Reward Model for Style-centric Generation Tasks
Bézier Degradation Modeling for LiDAR-based Human Motion Capture
See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
MVInverse: Feed-forward Multi-view Inverse Rendering in Seconds
Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching
JUMP-Hand: Learning Joint-wise Uncertainty to Gate Mixture of View Experts for Multi-View 3D Hand Reconstruction
Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model
Motion-Aware Animatable Gaussian Avatars Deblurring
3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image
Recovering Physically Plausible Human-Object Interactions from Monocular Videos
FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction
Bias at the End of the Score
Act2See: Emergent Active Visual Perception for Video Reasoning
Decompose, Mix, Adapt: A Unified Framework for Parameter-Efficient Neural Network Recombination and Compression
Beyond Binary Contrast: Modeling Continuous Skeleton Action Spaces with Transitional Anchors
SE(3)-Equivariance with Geometric and Topological Guidance for Category-Level Object Pose Estimation
StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues
SVHalluc: Benchmarking Speech–Vision Hallucination in Audio-Visual Large Language Models
Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage
ShreddingNet: Coarse-to-Fine Restoration for Multi-Source Shredded Manuscripts
Parse, Search, and Confirmation: Training-Free Aerial Vision-and-Dialog Navigation with Chain-of-Thought Reasoning and Structured Spatial Memory
Large-scale Robust Enhanced Ensemble Clustering via Outlier Decoupling
First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models
From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction
Learning Compact 3D Representations from Feed-Forward Novel View Synthesis
Imbalanced View Contribution Evaluation and Refinement for Deep Incomplete Multi-View Clustering
Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens
InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior
GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis
Self-supervised Dynamic Heterogeneous Degradation Modeling for Unified Zero-Shot Image Restoration
Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos
Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability
Anchoring and Rescaling Attention for Semantically Coherent Inbetweening
Latent Chain-of-Thought World Modeling for End-to-End Driving
Unified Personalized Understanding, Generating and Editing
FoundIR-v2: Optimizing Pre-Training Data Mixtures for Image Restoration Foundation Model
Feed-forward Gaussian Registration for Head Avatar Creation and Editing
EV-CGNet: Co-visible Focused 3D-guided 2D Event Keypoint Detection Network
SAGE: Style-Adaptive Generalization for Privacy-Constrained Semantic Segmentation Across Domains
MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection
Multimodal Distribution Matching for Vision-Language Dataset Distillation
PhyGaP: Physically-Grounded Gaussians with Polarization Cues
VDOT: Efficient Unified Video Creation via Optimal Transport Distillation
Coupling Liquid Time‑Constant Encoders with Modern Hopfield Memory
EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy
Dual Band Video Thermography: Separating Time-Varying Reflection and Emission Near Ambient Conditions
Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints
3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
BulletTime: Decoupled Control of Time and Camera Pose for Video Generation
RehearseVLA: Simulated Post-Training for VLAs with Physically-Consistent World Model
ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation
Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning
Open-Vocabulary Domain Generalization in Urban-Scene Segmentation
BDNet:Bio-Inspired dual-backbone Small Object Detection Network
CogniVerse: Revolutionizing Multi-modal Retrieval-Augmented Generation with Cognitive Reflection and Geometric Reasoning
DSERT-RoLL: Robust Multi-Modal Perception for Diverse Driving Conditions with Stereo Event-RGB-Thermal Cameras, 4D Radar, and Dual-LiDAR
Parallelised Differentiable Straightest Geodesics for 3D Meshes
Learning to Control Physically-simulated 3D Characters via Generating and Mimicking 2D Motions
LazyVAR: Accelerating Visual Autoregressive Models via Scale-wise Token Pruning and Parallel Group Decoding
Chain-of-Thought Guided Multi-Modal Object Re-Identification
Enhancing Video VLM with Visual-Audio Supersensing
PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion
PosterOmni: Generalized Artistic Poster Creation via Task Distillation and Unified Reward Feedback
Long-Tail Internet Photo Reconstruction
Breaking the Regional Perception Bottleneck of Multimodal Large Language Models via External Reasoning Framework
4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video
Improved Mean Flows: On the Challenges of Fastforward Generative Models
PersonaLive! Expressive Portrait Image Animation for Live Streaming
CLIP-like Model as a Foundational Density Ratio Estimator
SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling
Semi-supervised Echocardiography Video Segmentation via Anchor Semantic Awareness and Continuous Pseudo-label Reforging
PS-SR: Pseudo-Single-Step Video Super-Resolution via Speculative Diffusion
DynFusion: Rethinking Condition Fusion for Adaptive Multi-condition Text-to-Image Generation
Learn to Learn Weight Generation via Local Consistency Diffusion
PEARL: Geometry Aligns Semantics for Training-Free Open-Vocabulary Semantic Segmentation
PG-VTON: Single-Pass Training-Free Virtual Try-On via Patch-Guided Reference Alignment
VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM
STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction
DABO: Difficulty-Aware Bayesian Optimization with Diffusion-Learned Priors
VIRST: Video-Instructed Reasoning assistant for SpatioTemporal Segmentation
ConsistCompose: Unified Multimodal Layout Control for Image Composition
FHAvatar: Fast and High-Fidelity Reconstruction of Face-and-Hair Composable 3D Head Avatar from Few Casual Captures
CrossHOI: Learning Cross-View Representations for Monocular 3D Human-Object Interaction Reconstruction
Lightmover: Towards Precise and Efficient Control for Light Movement
SATTC: Structure-Aware Label-Free Test-Time Calibration for Cross-Subject EEG-to-Image Retrieval
Multi-Prototype Compactness and Boundary-Aware Synthesis for Unsupervised Anomaly Detection
ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets
ShiftLUT: Spatial Shift Enhanced Look-Up Tables for Efficient Image Restoration
Homaloidal parametrization for detecting critical two-view configurations
FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation
Depth Hypothesis Guided Iterative Refinement for Event–Image Monocular Depth Estimation
JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization
GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training
FusionAgent: A Multimodal Agent with Dynamic Model Selection for Human Recognition
GROW: Watermark Generation with Progressive Guidance for Diffusion Models
Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
Learning Like Humans: Analogical Concept Learning for Generalized Category Discovery
Towards Unified Human Perception and Machine Understanding: Token Flow Guided Compression Framework
GaussianDWM: Driving World Model using Language-aligned 3D Gaussians for Scene Understanding and Multi-modal Generation
StyleGallery: Training-free and Semantic-aware Personalized Style Transfer from Arbitrary Image References
Gated KalmaNet: A fading memory layer through test-time ridge regression
Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation
Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design
Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation
PRISM: Prototype-based Reasoning with Inter-modal Semantic Mining for Interpretable Image Recognition
ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation
RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations
Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation
GeCo: Geometry-Consistent Regularization for Domain Generalized Semantic Segmentation
Globscope: Toward a Global View of the Loss Landscape
Dual-level Adapter Boosting Prompt-free Curvilinear Structure Segmentation
GOR-IS: 3D Gaussian Object Removal In the Intrinsic Space
LAM: Language Articulated Object Modelers
ART: Articulated Reconstruction Transformer
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
MMVIP: A Visible-infrared Paired Dataset for Multi-weather Marine Vision
The Devil Is in Gradient Entanglement: Energy-Aware Gradient Coordinator for Robust Generalized Category Discovery
MatSpray: Fusing 2D Material World Knowledge on 3D Geometry
FlexiVideo: Variation-Aware Temporal Dynamics Modeling for Efficient Video Understanding
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions
Prompt-Anchored Vision–Text Distillation for Lifelong Person Re-identification
Spatial Matters: Position-Guided 3D Referring Expression Segmentation
CADC: Content Adaptive Diffusion-Based Generative Image Compression
Smart Replay: Adaptive Scheduling of Memory Rehearsal for Computational Resource-Aware Incremental Learning
FINER: MLLMs Hallucinate under Fine-grained Negative Queries
GFRRN: Explore the Gaps in Single Image Reflection Removal
Bridging Fidelity-Reality with Controllable One-Step Diffusion for Image Super-Resolution
Dynamic Label Noise Suppression with Optimal Teacher Pool for Facial Expression Recognition
Breaking Semantic Boundaries: Distribution-Guided Semantic Exploration for Creative Generation
DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video
TESO: Online Tracking of Essential Matrix by Stochastic Optimization
Conditional Factuality Controlled LLMs with Generalization Certificates via Conformal Sampling
UniPart: Part-Level 3D Generation with Unified 3D Geom–Seg Latents
Reward Sharpness-Aware Fine-Tuning for Diffusion Models
Towards Human-Like Robot Handwriting via Contour-Aware Generation
Quantized Residuals to Continuous Prompts for Few-Shot Class Incremental Learning in Vision-Language Models
BiTAgent: A Task-Aware Modular Framework for Bidirectional Coupling between Multimodal Large Language Models and World Models
NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining
Demo2Tutorial: From Human Experience to Multimodal Software Tutorials
Fine-Grained GRPO for Precise Preference Alignment in Flow Models
TopoSlide - Topologically-Informed Histopathology Whole Slide Image Representation Learning
GenMatter: Perceiving Physical Objects with Generative Matter Models
Reinforcing Video Object Segmentation to Think before it Segments
Continual Distillation of Teachers from Different Domains
VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression
SpatialTree: How Spatial Intelligence Branches Out in MLLMs
UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios
SMVRT: Implicit Human 3D Modeling Using Sparse Multi-view Volumetric Reconstruction with Transformer Fusion
NIL: No-data Imitation Learning
MA-Bench: Towards Fine-grained Micro-Action Understanding
A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction
VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding
EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories
QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy
FILTR: Extracting Topological Features from Pretrained 3D Models
Iris: Integrating Language into Diffusion-based Monocular Depth Estimation
NEAF: Natural Image Editing with Attention Fusion for Generalizable Tuning-Free Text-Guided Image Editing
MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures
Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes
GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion
PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation
EDGS: Eliminating Densification for Efficient Convergence of 3DGS
Diffusion-Based Native Adversarial Synthesis for Enhanced Medical Segmentation Generalization
FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning
Boosting Reasoning in Large Multimodal Models via Activation Replay
FusionRegister: Every Infrared and Visible Image Fusion Deserves Registrtaion
FlashPortrait: 6$\times$ Faster Infinite Portrait Animation with Adaptive Latent Prediction
HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling
Decoupled and Reusable Adaptation for Efficient Cross-Modal Transfer
CLEP: Contrastive Language-Pose Pretraining
Choreographing a World of Dynamic Objects
MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters
GeneVAR: Causal MeanFlow for Autoregressive Gene-to-WSI Tile Synthesis
LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens
Unsafe2Safe: Controllable Image Anonymization for Downstream Utility
DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images via Diffusion Transformers
MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents
Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model
AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting
Cross-Architecture Adaptation: Cloud-Edge Continual Test-Time Adaptation with Dynamic Sampling and Heterogeneous Distillation
LaVR: Latent Space Conditioned Video Re-rendering using Large 4D Reconstruction Models
Task-Driven Implicit Representations for Automated Design of LiDAR Systems
DeepAlign: Mitigating Modality Conflict through Modality-Specific Alignment
LumiX: Structured and Coherent Text-to-Intrinsic Generation
VL-RouterBench: A Benchmark for Vision–Language Model Routing
MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation
Agentic Video Summarization via Self-Reflecting Multimodal Understanding
Resolving Endpoint Underfitting in Diffusion Bridges via Noise Alignment
Event-based Visual Deformation Measurement
Gaussian Splatting-based Low-Rank Tensor Representation for Multi-Dimensional Image Recovery
Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI
Divide, Conquer, and Aggregate: Asymmetric Experts for Class-Imbalanced Semi-Supervised Medical Image Segmentation
Image-to-Point Cloud Feature Back-projection for Multimodal Training of 3D Semantic Segmentation
Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models
UniPR: Unified Object-level Real-to-Sim Perception and Reconstruction from a Single Stereo Pair
CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics
OralGPT-Omni: A Versatile Dental Multimodal Large Language Model
Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge
Semantic Scale Space: A Framework for Controllable Image Abstraction
VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer
PatchScene: Patch-based Voxel Diffusion Model for Large-Scale Scene Completion
EchoVDiff: Cardiac-Cycle Echocardiography Video Generation from Arbitrary Frame
Rethinking Concept Bottleneck Models: From Pitfalls to Solutions
Label-Free Cross-Task LoRA Merging with Null-Space Compression
Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding
Exploring 6D Object Pose Estimation with Deformation
SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens
MatchMask: Mask-Centric Generative Data Augmentation for Label-Scarce Semantic Segmentation
Stabilizing Streaming Video Geometry via Dynamic Feature Normalization
Resolving the Identity Crisis in Text-to-Image Generation
Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs
Your Latent Mask is Wrong: Pixel-Equivalent Latent Compositing for Diffusion Models
Hierarchically Robust Zero-shot Vision-Language Models
Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach
Linking Modality Isolation in Heterogeneous Collaborative Perception
Global Information Thresholding for Sufficient and Necessary Circuits
Socratic-Geo: Synthetic Data Generation and Cross-Modal Geometric Reasoning via Multi-Agent Interaction
Any Resolution Any Geometry: From Multi-View To Multi-Patch
Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images
CRFT: Consistent–Recurrent Feature Flow Transformer for Cross-Modal Image Registration
EventGait: Towards Robust Gait Recognition with Event Streams
Semantic Audio-Visual Navigation in Continuous Environments
ARCache: Mitigating Error Accumulation for Caching-based Acceleration in Autoregressive Video Diffusion Models
TRANSPORTER: Transferring Visual Semantics from VLM Manifolds
Uni-Hema: Unified Model for Digital Hematopathology
Coupled Diffusion Sampling for Training-free Multi-view Image Editing
Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events
AERGS-SLAM: Auto-Exposure-Robust Stereo 3D Gaussian Splatting SLAM
Efficiency Follows Global-Local Decoupling
Efficient and Training-Free Single-Image Diffusion Models
HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps
Regulating Rather than Constraining: Adaptive Guidance for Complex Spectral Reconstruction in Pansharpening
Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
Human Geometry Distribution for 3D Animation Generation
Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition
DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models
AnimaMimic: Imitating 3D Animation from Video Priors
Unsupervised 3d Motion Estimation Using Event Camera
MoVie: Broaden Your Views with Human Motion for Action Detection
Align Images Before You Generate
Generalized-CVO: Fast and Correspondence-Free Point Cloud Registration in RKHS with Second Order Riemannian Optimization
AdaDexTrack: Dynamic Modulation for Adaptive and Generalizable Dexterous Manipulation Tracking
Dynamic Momentum Recalibration in Online Gradient Learning
FUSER: Feed-Forward Multiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement
Image Generation from Contextually-Contradictory Prompts
Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers
Dynamic Important Example Mining for Reinforcement Finetuning
BuildingGPT: Auto-Regressive Building Wireframe Reconstruction Model with Reinforcement Learning
VES-RFT: Rewarding Visual Evidence Sensitivity to Mitigate Hallucinations in Large Vision–Language Models
Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding
Information-Theoretic Decomposition for Multimodal Interaction Learning
Towards Persistence: Learning Topological Constraints for Event-based Small Object Detection
Balanced Dataset Distillation via Modeling Multiple Visual Pattern Distribution
Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers
PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes
Generative Diffusion Priors for 3D Mapping of the Dark Universe
Spike-driven Discrete Aggregation for Event-based Object Detection
Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation
MedFG-VQA: Low-Frequency Memory and Graph Attention for Lightweight Medical VQA
PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts
Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification
NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction
Global Underwater Geolocation from Time-Lapse Polarization Imagery
Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
Controllable Stereo Video Conversion with Guided Latent Decoding
Geometry-Guided 3D Visual Token Pruning for Video-Language Models
Anatomica: Localized Control over Geometric and Topological Properties for Anatomical Diffusion Models
Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning
Depth Peeling for High-Fidelity Gaussian-Enhanced Surfel Rendering
Self-guided Semantic Inspection for Zero-Shot Composed Image Retrieval
Multimodal Semantic Bias Mitigation for Diverse Text-To-3D Generation
RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting
ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation
Learning a Unified Latent Action Space from Videos with Action-centric Cycle Consistency
Cross-Hand Latent Representation for Vision-Language-Action models
R2G:A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII
LS-ViT: Least-Squares Hessian Based Block Reconstruction for Low-Bit Post-Training Quantization of Vision Transformers
Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset
BackSplit: The Importance of Sub-dividing the Background in Biomedical Lesion Segmentation
ShadowDraw: From Any Object to Shadow–Drawing Compositional Art
Ani3DHuman: Photorealistic 3D Human Animation with Self-guided Stochastic Sampling
Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment
PnP-CM: Consistency Models as Plug-and-Play Priors for Inverse Problems
mmWaveFlow: Unified Enhancement and Generation of mmWave Human Point Clouds
RebRL: Reinforcing Discrete Visual Diffusion Models with Rebalanced Timestep Credits
AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion
Landscape-Awareness for Geometric View Diffusion Model
DGS: Dual Gradient and Semantic-Shift Guided Low-Rank Adaptation for Class Incremental Learning
VQ-VA World: Towards High-Quality Visual Question-Visual Answering
Curvature-Aware Captioning: Leveraging Geodesic Attention for 3D Scene Understanding
Spatia: Video Generation with Updatable Spatial Memory
ObjectMorpher: 3D-Aware Image Editing via Deformable 3DGS
VISTA: A Test-Time Self-Improving Video Generation Agent
VRCLIP: Multimodal Canonical Correlation Alignment for CLIP-Driven Vision-Radio Person Re-Identification
OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding
SASNet: Spatially-Adaptive Sinusoidal Networks for INRs
Spk2VidNet: A Hierarchical Recurrent Architecture for High-Fidelity Video Reconstruction from Long Spike-Camera Streams
Mark4D: Temporally-Consistent Watermarking for 4D Gaussian Splatting
Hyperbolic Busemann Neural Networks
GeCo-SRT: Geometry-aware Continual Adaptation for Cross-Task Sim-to-Real Transfer
WeDetect: Fast Open-Vocabulary Object Detection as Retrieval
Monet: Reasoning in Latent Visual Space Beyond Image and Language
Eulerian Gaussian Splatting using Hashed Probability Pyramids
Disentangled Textual Priors for Diffusion-based Image Super-Resolution
PhysGen: Physically Grounded 3D Shape Generation for Industrial Design
AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation
Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis
Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping
Discriminative Perception via Anchored Description for Reasoning Segmentation
GenBreak: Red Teaming Text-to-Image Generation Using Large Language Models
Grounding Everything in Tokens for Multimodal Large Language Models
Streaming Video Instruction Tuning
StableMTL: Repurposing Latent Diffusion Models for Multi-Task Learning from Partially Annotated Synthetic Datasets
Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning
P-Flow: Prompting Visual Effects Generation
Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation
Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering
Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection
MatMart: Material Reconstruction of 3D Objects via Diffusion
Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning
SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead
Expert-Teacher-Student Collaborative Learning for Domain Adaptive Object Detection
Solving Minimal Problems Without Matrix Inversion Using FFT-Based Interpolation
ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions
What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching?
Rectifying Latent Space for Generative Single-Image Reflection Removal
VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement
FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs
OctoT2I: A Self-Evolving Agentic Text-to-Image Router
Learning to Select Visual Tools from Experience
FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning
ClusterMark: Towards Robust Watermarking for Autoregressive Image Generators with Visual Token Clustering
LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes
PAI-Bench: A Comprehensive Benchmark For Physical AI
GM-R$^2$: Generative Matching Learning for Unsupervised Geometric Representation and Registration
Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training
Emergent Outlier View Rejection in Visual Geometry Grounded Transformers
Plenoptic Video Generation
VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation
Zero-Shot Depth Completion with Vision-Language Model
Hidden Dangers of Compositional Generation: Diagnosing Semantic Safety Failures in Text-to-Image Models
Parallel Rigidity Matters for Bundle Adjustment
CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video
From Few-way to Many-way: Rethinking Few-shot Fine-grained Image Classification
Language-Grounded Decoupled Action Representation for Robotic Manipulation
MimicTalker: A Multimodal Interactive and Memory-Enhanced Framework for Real-Time Dyadic 3D Head Generation
Bi-directional Autoregressive Diffusion for Large Complex Motion Interpolation
BIT: Matching-based Bi-directional Interaction Transformation Network for Visible-Infrared Person Re-Identification
Batch Loss Score for Dynamic Data Pruning
Weaver: Decoupled Training for Interleaved Multi-modal Generation
4D Local Modeling Toward Dynamic Global Perception for Ambiguity-free Rotation-Invariant Point Cloud Analysis
L3DR: 3D-aware LiDAR Diffusion and Rectification
Splat-Based Metal Artifact Reduction in Cone-Beam CT via Compact Attenuation Modeling
Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images
Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models
DeepProtect: Proactive Face-Swapping Defense using Identity Blending and Attribute Distortion
Random Wins All: Rethinking Grouping Strategies for Vision Tokens
Write Where It Matters: Policy-Guided Watermarks for 3D Gaussian Splatting
MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving
Boosting Self-Supervised Tracking with Contextual Prompts and Noise Learning
VMD-FACT: A New Video Dataset and MLLM-based method for Detecting Realistic AI-Generated Video Misinformation
Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists
MeshWeaver: Sparse-Voxel-Guided Surface Weaving for Autoregressive Mesh Generation
EXOTIC: External Vision-driven Incomplete Multi-view Classification
Robust Remote Sensing Image–Text Retrieval with Noisy Correspondence
TLMA: Mitigating the Impact of Weakly Labeled Information for Video Anomaly Detection
RDF-MIG: A Robust Diffusion Framework for Masked Image Generation to Augment Semantic Segmentation and Change Detection
LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models
MLLMSplat: A 2D MLLM-Powered Framework for 3D Gaussian Splatting Understanding, Generation, and Editing
Unified Customized Generation by Disentangled Reward Modeling
Do Vision-Language Models Leak What They Learn? Adaptive Token-Weighted Model Inversion Attacks
Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization
LoL: Longer than Longer, Scaling Video Generation to Hour
Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding
RF4D:Neural Radar Fields for Novel View Synthesis in Outdoor Dynamic Scenes
Modeling the Brain’s Grammar: ROI-Guided fMRI Pretraining for Transferable and Interpretable Vision Decoding
ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers
Particulate: Feed-Forward 3D Object Articulation
HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition
Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection
Designing to Forget: Deep Semi-parametric Models for Unlearning
Learning from Itself: Mining Internal Knowledge from Vision Language Models for Continual Learning
OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation
iLRM: An Iterative Large 3D Reconstruction Model
B$^3$-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta–Bernoulli Bayesian Updates
Taming Noise-Induced Prototype Degradation for Privacy-Preserving Personalized Federated Fine-Tuning
Role-SynthCLIP: A Role-Play Driven Diverse Synthetic Data Approach
MVP: Multiple View Prediction improves GUI grounding
DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
Is Parameter Isolation Better for Prompt-Based Continual Learning?
E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training
MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification
UniLight: A Unified Representation for Lighting
Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning
UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching
NexusFlow: Unifying Disparate Tasks under Partial Supervision via Invertible Flow Networks
Unified Primitive Proxies for Structured Shape Completion
AudioStory: Generating Long-Form Narrative Audio with Large Language Models
ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking
PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
AnyID: Ultra-Fidelity Universal Identity-Preserving Video Generation from Any Visual References
Detect Anything via Next Point Prediction
SuP: Sub-cloud Driven Point Cloud Registration
AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception
Clay-to-Stone: Phase-wise 3D Gaussian Splatting for Monocular Articulated Hand-Object Manipulation Modeling
DynamicsBoost: Dynamic Plausible Video Generation via Annotation-Free Continuation Preference Optimization
Noise-aware few-shot learning through bi-directional multi-view prompt alignment
Differentially Private 2D Human Pose Estimation
M${^2}$SeR: Multimodal Self-Refinement for Lightweight Image Captioning
Semantic Foam: Unifying Spatial and Semantic Scene Decomposition
PV-Ground: Text-Guided Point-Voxel Interaction for 3D Visual Grounding
SignPR: A Progressive Vector-Quantized Diffusion Framework for Sign Language Production
Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis
Cross-Domain Demo-to-Code via Neurosymbolic Counterfactual Reasoning
Enhance-then-Balance Modality Collaboration for Robust Multimodal Sentiment Analysis
NOWA: Null-space Optical Watermark for Invisible Capture Fingerprinting and Tamper Localization
GenSplat: Bridging the Generalization Gap in 3DGS Language Comprehension
Action–Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation
Sparse Spectral LoRA: Routed Experts for Medical VLMs
Mind the Gap: Transferring Labels to Align Object Detection Datasets
Toward Low-Cost yet Effective Temporal Learning for UAV Tracking
MFEN: Multi-Frequency Expert Network for Visible-Infrared Person Re-ID
DiffusionFF: A Diffusion-based Framework for Joint Face Forgery Detection and Fine-Grained Artifact Localization
Reliev3R: Relieving Feed-forward 3D Reconstruction from Multi-View Geometric Annotations
Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking from Sparse Inertial Sensors and Ranging-based Between-sensor Distances
GDRO: Group-level Reward Post-training Suitable for Diffusion Models
KV-Tracker: Real-Time Pose Tracking with Transformers
2-Shots in the Dark: Low-Light Denoising with Minimal Data Acquisition
Plug-and-Play PDE Optimization for 3D Gaussian Splatting: Toward High-Quality Rendering and Reconstruction
Towards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels
RawMetaDiff: Unlocking Extreme Darkness from Dual-Exposure RAW with Meta-Guided Diffusion
SAGA: Source Attribution of Generative AI Videos
ImageRAGTurbo: Towards One-step Text-to-Image Generation with Retrieval-Augmented Diffusion Models
DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance
Learning Personalized Photographic Style from Pairwise User Preferences
PoInit-of-View: Poisoning Initialization of Views Transfers Across Multiple 3D Reconstruction Systems
Beyond [CLS] Token: Query-Driven Token-Level Forgery Purification for Generalizable Deepfake Detection
VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference
Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching
FlashVSR: Towards Real-time Diffusion-Based Streaming Video Super Resolution
Adaptive Confidence Regularization for Multimodal Failure Detection
Geometric-Aware Hypergraph Reasoning for Novel Class Discovery in Point Cloud Segmentation
PhyCo: Learning Controllable Physical Priors for Generative Motion
VesMamba: 3D Pulmonary Vessel Segmentation from CT images via Mamba with Structural Perception and Scale-aware Filtering
Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency
ORD: Object-Relation Decoupling for Generalized 3D Visual Grounding
You Only Erase Once: Erasing Anything without Bringing Unexpected Content
Vibe Spaces for Creatively Connecting and Expressing Visual Concepts
A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning
Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models
MooCap: A Multi-View Benchmark for Cow-Object-Human Interaction and Behavior Dynamics
Multi-Hierarchical Contrastive Spectral Fusion for Multi-View Clustering
YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal
PanDA: Panoptic Domain Adaptation for Multimodal Perception in Autonomous Driving
FARMER: Flow AutoRegressive Transformer over Pixels
VEMamba: Efficient Isotropic Reconstruction of Volume Electron Microscopy with Axial-Lateral Consistent Mamba
APPO: Attention-guided Perception Policy Optimization for Video Reasoning
Self-Evaluation Unlocks Any-Step Text-to-Image Generation
VMonarch: Efficient Video Diffusion Transformers with Structured Attention
Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild
Cross-Modal Guided Visual Synthesis for Data-Efficient Multimodal Depression Recognition
Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism
Towards Robust Sequential Decomposition for Complex Image Editing
SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing
SJD++: Accelerating Speculative Jacobi Decoding for Text-to-Image Models via Multi-Drafting and Enhanced Rejection Stability
SURF: Signature-retained Fast Video Generation
Enhancing Vision Language Models for 4D Perception
Endless World: Real-Time 3D-Aware Long Video Generation
DDT: Decoupled Diffusion Transformer
Incentivizing Generative Zero-Shot Learning via Outcome-Reward Reinforcement Learning with Visual Cues
Why Not Hyperparameter-Friendly Optimisation? A Monotonic Adaptive Norm Rescaling Approach For Long-Tailed Recognition
TIM: Temporal Decoupling with Iterative Mutual-Refinement Model for Longitudinal Radiology Report Generation
ReFlow: Self-correction Motion Learning for Dynamic Scene Reconstruction
SDGS: Spatial Difference Guided Gaussian Splatting for Simultaneous Localization and 3D Reconstruction
Guiding Diffusion Models with Fine-Grained Conditions and Semantics-Preserving Sampling for One-Shot Federated Learning
WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation
LensWalk: Agentic Video Understanding by Planning How You See in Videos
3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds
CIGPose: Causal Intervention Graph Neural Network for Whole-Body Pose Estimation
Efficiently Reconstructing Dynamic Scenes one D4RT at a Time
RISE: Single Static Radar-based Indoor Scene Understanding
Beyond Prompt Degradation: Prototype-guided Dual-pool Prompting for Incremental Object Detection
RefAV: Towards Planning Centric Scenario Mining
LoST: Level of Semantics Tokenization for 3D Shapes
MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation
TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens
TIACam: Text-Anchored Invariant Feature Learning with Auto-Augmentation for Camera-Robust Zero-Watermarking
Anti-I2V: Safeguarding your photos from malicious image-to-video generation
RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution
Turbo-GS: Accelerating 3D Gaussian Fitting for High-Resolution Radiance Fields
Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models
Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval
Dejavu: Towards Experience Feedback Learning for Embodied Intelligence
DC-Merge: Improving Model Merging with Directional Consistency
Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework
Lighting-grounded Video Generation with Renderer-based Agent Reasoning
Hybrid Agents for Image Restoration
Gaussian Mapping for Evolving Scenes
ReManNet: A Riemannian Manifold Network for Monocular 3D Lane Detection
VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension
REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting
Finding Distributed Object-Centric Properties in Self-Supervised Transformers
GaussFusion: Improving 3D Reconstruction in the Wild with Geometry-Informed Video Generator
Learning to Drive via Real-World Simulation at Scale
CrackSSM: Reviving SSMs for Crack Segmentation via Dynamic Scanning
TGT: Text-Grounded Trajectories for Locally Controlled Video Generation
FilterGS: Traversal-Free Parallel Filtering and Adaptive Shrinkage for Large-scale LoD 3D Gaussian Splatting
VLM-PTQ: Efficient Post-Training Quantization for Large Vision-Language Models
fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding
VITAL: Vision-Encoder-centered Pretraining for LMMs in Visual Quality Assessment
POUR: A Provably Optimal Method for Unlearning Representation via Neural Collapse
Reliable Policy Transfer for Safety-Aware End-to-End Driving with Deep Reinforcement Learning
What Is the Optimal Ranking Score Between Precision and Recall? We Can Always Find It and It Is Rarely $F_1$
Scaling Multi-Identity Consistency for Image Customization via Multi-to-Multi Matching Paradigm
BinaryAttention: One-Bit Attention for Vision and Diffusion Transformers
Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors
ReasonX: MLLM-Guided Intrinsic Image Decomposition
Scaling View Synthesis Transformers
UI-Lens: Assessing General MLLMs’ Potential to Automate UI Display Quality Assurance
C$^3$R: Cross-Modal Cycle Consistency Rewards Improve Multimodal Reasoning
OneHOI: Unifying Human-Object Interaction Generation and Editing
SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation
Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks
DEVA: Fine-tuning Multimodal Large Language Models for Visual Perception Tasks
EfficientMonoHair: Fast Strand-Level Reconstruction from Monocular Video via Multi-View Direction Fusion
Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images
Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos
Geometry-driven OOD Detectors Are Class-Incremental Learners
LogCD: Local-to-global Consistency Distillation for Few-step Image Generation
EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors
CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection
DualMirage: Hunting Stealthy Multimodal LLM Agents via CAPTCHAs with Contour and Adversarial Illusions
LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models
Time Blindness: Why Video-Language Models Can’t See What Humans Can?
FaceDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment
U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation
SeD-UD: An Influence-Driven and Hierarchically-Decoupled Information Bottleneck for Multimodal Intent Recognition
Render-to-Adapt: Unsupervised Personal Adaptation for Gaze Estimation
Video Panels for Long Video Understanding
Phrase-Grounding-Aware Supervised Fine-Tuning for Chart Recognition via Side-Masked Attention
SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models
Unlocking Pre-trained Weights: Parameter Inheritance for Zero-Shot Initialization
Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction
Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces
SpiralDiff: Spiral Diffusion with LoRA for RGB-to-RAW Conversion Across Cameras
EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization
FreqSIC: Frequency-aware Stereo Image Compression with Bi-directional Checkerboard Context Model
Duala: Dual-level Alignment of Subjects and Stimuli for Cross-Subject fMRI Decoding
Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
FedRAC: Rolling Submodel Allocation for Collaborative Fairness in Federated Learning
RAAS: LLM Agentic System Architecture Search with GRPO
Reasoning-Driven Anomaly Detection and Localization with Image-Level Supervision
Merge3D: Efficient 3D Multimodal LLMs via Joint 2D-3D Token Merging
Beyond the Ground Truth: Enhanced Supervision for Image Restoration
GenErase: Generalizable and Semantically-Aware Concept Erasure in Diffusion Models
A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps
PatchAlign3D: Local Feature Alignment for Dense 3D Shape understanding
SoccerMaster: A Vision Foundation Model for Soccer Understanding
One Token, Two Fates: A Unified Framework via Vision Token Manipulation Against MLLMs Hallucination
FRM: Linear-Time 3D Reconstruction via Test-Time Training
Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image
Generative Neural Video Compression via Video Diffusion Prior
PPM-CLIP: Probabilistic Prompt Modeling for Generalizable AI-Generated Image Detection
LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding
RPGFusion: 4D Radar Prior-Guided Multi-Modal Fusion for 3D Detection
Curvature-Aware Zeroth-Order Optimization for Memory-Efficient Test-Time Adaptation
Material Magic Wand: Material-Aware Grouping of 3D Parts in Untextured Meshes
ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization
OpenDPR: Open-Vocabulary Change Detection via Vision-Centric Diffusion-Guided Prototype Retrieval for Remote Sensing Imagery
Learning and Aligning Click-Aware Shape Prior for Interactive Amodal Instance Segmentation
RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward
3DrawAgent: Teaching LLM to Draw in 3D with early relative experience
Underground Plant Exploration: Non-Destructive 3D Root Assessment with GPR Based on Point Graph Neural Network
GauMVC: Generative Decoupled Gaussian Representation for Human-centric Multi-view Video Compression
Lafite : A Generative Latent Field for 3D Native Texturing
From Global Alignment to Local Semantics: Understanding Visual Representations Structures in Multimodal LLMs
Measuring the (Un)Faithfulness of Concept-Based Explanations
Scaling Parallel Sequence Models to Vision Foundation Models
Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting
Point Cloud as a Foreign Language for Multi-modal Large Language Model
Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation
MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question Answering
HSI-GPT2: A Dual-Granularity Large Motion Reasoning Model with Diffusion Refinement for Human–Scene Interaction
HQC-NBV: A Hybrid Quantum-Classical View Planning Approach
Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow
Action Motifs: Self-Supervised Hierarchical Representation of Human Body Movements
Enhancing Accuracy of Uncertainty Estimation in Appearance-based Gaze Tracking with Probabilistic Evaluation and Calibration
From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing
Training-free Mixed-Resolution Latent Upsampling for Spatially Accelerated Diffusion Transformers
FVBench: Benchmarking Deepfake Video Detection Capability of Large Multimodal Models
MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction
WonderZoom: Multi-Scale 3D World Generation
V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties
Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization
Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
Test-Time Training for LiDAR Semantic Segmentation under Corruption via Geometric Inlier Discrimination
RigMo: Unifying Rig and Motion Learning for Generative Animation
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
XR-Poser: Accurate Egocentric Human Motion Estimation for AR/VR
Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Models
FBTA: Enabling Single-GPU End-to-End Gigapixel WSI Classification with Feature Bridging and Translation Alignment
SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning
MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model
PE3R: Perception-Efficient 3D Reconstruction
Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining
MaskDexGrasp: Generative Masked Modeling for Part-Aware Dexterous Grasp Synthesis
Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos
Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval
GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction
Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty
Guiding a Diffusion Transformer with the Internal Dynamics of Itself
ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps
Boosting Vision-Language Models Towards Cross-Domain Incremental Object Detection
Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation
LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World
Affine Perspective-Three-Point Problem
StreamingTOM: Streaming Token Compression for Efficient Video Understanding
From Rays to Projections: Better Inputs for Feed-Forward View Synthesis
Distilling Balanced Knowledge from a Biased Teacher
Flow Map Distillation Without Data
GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation
Dynamics-Aware Preference Optimization for Vision-Language Models
TextOVSR: Text-Guided Real-World Opera Video Super-Resolution
Affostruction: 3D Affordance Grounding with Generative Reconstruction
YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection
PhysHO: Physics-Based Dynamic 3D Gaussian Human and Object from Monocular Video
WaDi: Weight Direction-aware Distillation for One-step Image Synthesis
CoT-Edit: Let CoT Guide Instruction Video Editing
STCDiT: Spatio-Temporally Consistent Diffusion Transformer for High-Quality Video Super-Resolution
Obstruction reasoning for robotic grasping
Transition Models: Rethinking the Generative Learning Objective
3D Space as a Scratchpad for Editable Text-to-Image Generation
RegionFuse: Region-Adaptive Pixel Distribution Learning for Infrared and Visible Image Fusion
Event Structural Valley: A Unified Theoretical and Practical Framework for Event Camera Autofocus
MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting
SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
Semantic Alignment for Pose-Invariant Identity Preserving Diffusion
Event-based Motion Deblurring with Unpaired Data
ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction
ProjFlow: Projection Sampling with Flow Matching for Zero‑Shot Exact Spatial Motion Control
CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance
Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing
High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning
DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer
VOSR: A Vision-Only Generative Model for Image Super-Resolution
MAD: Motion Appearance Decoupling for efficient Driving World Models
Omni-AD: A Large-scale and Versatile Benchmark for Industrial Anomaly Detection
ResiHMR: Residual-Limb Aware Single-Image 3D Human Mesh Recovery for Individuals with Limb Loss
Efficient Encoder-Free Fourier-based 3D Large Multimodal Model
From Spots to Pixels: Dense Spatial Gene Expression Prediction from Histology Images
XPaintNet: An eXtreme Lightweight Framework for Stereoscopic Conversion without Inpainting Network
Hierarchical Concept Embedding & Pursuit for Interpretable Image Classification
ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction
VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion
Relational Visual Similarity
MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models
Text-guided Feature Disentanglement for Cross-modal Gait Recognition
D-Convexity: A Unified Differentiable Convex Shape Prior via Quasi-Concavity for Data-driven Image Segmentation
CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
Hg-I2P: Bridging Modalities for Generalizable Image-to-Point-Cloud Registration via Heterogeneous Graphs
Inter-Edit: First Benchmark for Interactive Instruction-Based Image Editing
No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors
Verifying Neural Network Robustness with Dual Perturbations
FastGS: Training 3D Gaussian Splatting in 100 Seconds
Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation
EMGauss: Continuous Slice-to-3D Reconstruction via Dynamic Gaussian Modeling in Volume Electron Microscopy
Beyond Rule-Based Agents: Active Markov Games for Realistic Multi-Agent Interaction in Autonomous Driving
TrackMAE: Video Representation Learning via Track Mask and Predict
Vision Transformers Need More Than Registers
InterRVOS: Interaction-Aware Referring Video Object Segmentation
Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation
TokenTrace: Multi-Concept Attribution through Watermarked Token Recovery
Diffusion Sampling Path Tells More: An Efficient Plug-and-Play Strategy for Sample Filtering
Rotation Invariant and Symmetry Aware Pixel Difference Network for Remote Sensing Object Detection
ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions
SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping
SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching
Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection
Learning Differentiable Hierarchies in 3D Gaussian Splatting
UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking
GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning
PhysSkin: Real-Time and Generalizable Physics-Based Animation via Self-Supervised Neural Skinning
WildCap: Facial Appearance Capture in the Wild via Hybrid Inverse Rendering
Polarization State Tracing for Reflection Removal and Color-Consistent Reconstruction
FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation
$\text{F}^2\text{HDR}$: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling
Personalized Audio-driven Whole-body Talking Avatars
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
LRDUN: A Low-Rank Deep Unfolding Network for Efficient Spectral Compressive Imaging
FlashDecoder: Real-Time Latent-to-Pixel Streaming Decoder with Transformers
LEADER: Learning Reliable Local-to-Global Correspondences for LiDAR Relocalization
VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale
TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts
KASALv2: Fully Automatic 3D Rotational Symmetry Classification and Axis Localization
Rethinking Intermediate Representation for VLM-based Robot Manipulation
HierUQ: Hierarchical Uncertainty Quantification with Adaptive Granularity Reconciliation for Degraded Image Classification
Unleashing the Power of Chain-of-Prediction for Monocular 3D Object Detection
Spatially Consistent 3D Universal Adversarial Objects for BEV Detectors
Physical Simulator In-the-Loop Video Generation
PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation
Test-Time Multi-Prompt Adaptation for Open-Vocabulary Remote Sensing Image Segmentation
Frequency-domain Manipulation for Face Obfuscation
MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging
Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models
Chaining Basic Capabilities for Embodied Task Planning
PowerCLIP: Powerset Alignment for Fine-Grained Contrastive Pre-Training
Generative Video Motion Editing with 3D Point Tracks
Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context
VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking
R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment
FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips
Revisiting Visual Corruptions in LVLMs: A Shape–Texture Perspective on Model Failures
ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models
CoSMo3D: Open-World Promptable 3D Semantic Segmentation through LLM-Guided Canonical Spatial Modeling
DLVP-CLIP: Enhancing Fine-Grained Zero-Shot Anomaly Detection via Dynamic Local Visual Prompting
ReMoT: Reinforcement Learning with Motion Contrast Triplets
Region-Adaptive Sampling for Diffusion Transformers
Anchor-Guided Gradient Alignment for Incomplete Multimodal Learning
SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals
Locate-Then-Examine: Grounded Region Reasoning Improves Detection of AI-Generated Images
CDICS: Delving Into Fine-Grained Attribute for In-Context Segmentation via Compositional Prompts and Phased Decoupling
BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection
Efficient Video Object Segmentation and Tracking with Recurrent Dynamic Submodel
Towards Open Environments: General Vision-Language Navigation via Fast-Slow Interactive Reasoning
VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging
From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking
3D Gaussian Splatting at Arbitrary Resolution with Compact Proxy Anchors
SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation
Visual Autoregressive Modeling via Next Focus Prediction
Model Merging in the Essential Subspace
A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett–Luce Ranking
Multi-view Pyramid Transformer: Look Coarser to See Broader
The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment
GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction
HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering
Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern
rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training
Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling
VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling
AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend
VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models
SafeRoPE:Risk-specific Head-wise Embedding Rotation for Safe Generation in Rectified Flow Transformers
Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners
TouchDream: 3D Object Completion through Imagined Touch
Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation
MatLat: Material Latent Space for PBR Texture Generation
Diffusion Guided Chain-of-Vision for Large Autoregressive Vision Models
Mixture of Prototypes for Test-time Adaptive Segmentation
LF-BVN: Blind-View Network for Self-Supervised Light Field Denoising
PMRNet: Physics-informed Multi-scale Refinement Network for Medical Image Segmentation
VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction
Scale Space Diffusion
VGGTracker: Fast Spatial Tracking with Visual Geometry Transformer
DeRVOS: Decoupling Consistent Trajectory Generation and Multimodal Understanding for Referring Video Object Segmentation
Unlocking Motion from Large Vision Models with a Semantic and Kinematic Duality for Gait Recognition
CaricHarmony: Contrastive Diffusion Paths for Identity-Preserving Caricature Synthesis
DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression
Trust-calibrated Collaborative Learning for Long-Tailed Visual Recognition
Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers
iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception
Rethinking Glyph Spatial Information in Font Generation
Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search
Diffusion with a Linguistic Compass: Steering the Generation of Clinically Plausible Future sMRI Representations for Early MCI Conversion Prediction
Pose-guided Enriched Feature Learning for Federated-by-camera Person Re-identification
FlowSteer: Guiding Few-Step Image Synthesis with Authentic Trajectories
MoRe: Motion-aware Feed-forward 4D Reconstruction Transrformer
CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection
2nd Match: Finetuning Pruned Diffusion Models via Second-Order Jacobian Matching
Enhancing Part-Level Point Grounding for Any Open-Source MLLMs
PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation
REL-SF4PASS: Panoramic Semantic Segmentation with REL Depth Representation and Spherical Fusion
HFedATM: Hierarchical Federated Domain Generalization via Optimal Transport and Regularized Mean Aggregation
V-DPM: Video Reconstruction with Dynamic Point Maps
TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures
Dynamic Logits Adjustment and Exploration for Test-Time Adaptation in Vision Language Models
SGS-Intrinsic: Semantic-Invariant Gaussian Splatting for Sparse-View Indoor Inverse Rendering
ElasticFormer: Detecting Objects in HRW Shots via Elastic Computing Vision Transformer
GEM: Generating LiDAR World Model via Deformable Mamba
Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation
ViTPrompt: Training-Free Prompt Refinement with Visual Tokens for Open-Vocabulary Detection
3D-LATTE: Latent Space 3D Editing from Textual Instructions
Unified Vector Floorplan Generation via Markup Representation
Differentiable Stroke Planning with Dual Parameterization for Efficient and High-Fidelity Painting Creation
Bootstrapping Multi-view Learning for Test-time Noisy Correspondence
RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models
EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval
CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization
Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation
Generative Point Tracking and Trajectory Forecasting
Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation
Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy
CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation
Low-Rank Residual Diffusion Models
Exploring Visual Pretraining for Learning Language Intelligence
Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
Composing Concepts from Images and Videos via Concept-prompt Binding
Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement
Abstract 3D Perception for Spatial Intelligence in Vision-Language Models
E-3DPSM: A State Machine for Event-based Egocentric 3D Human Pose Estimation
All Vehicles Can Lie: Efficient Adversarial Defense in Fully Untrusted-Vehicle Collaborative Perception via Pseudo-Random Bayesian Inference
What Your Features Reveal: Data-Efficient Black-Box Feature Inversion Attack for Split DNNs
What Is It Like to Be a Noise? An Entropy-based Gaussian Noise Regularization for Diffusion Models
Content-Aware Dynamic Patchification for Efficient Video Diffusion
CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning
Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models
MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second
CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning
Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer
Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction
MMCP-GEN: A Modality-Extensible Diffusion Language Model for Conditional Protein Sequence Generation
Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency
ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior
Evidential Neural Radiance Fields
PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning
Region-Wise Correspondence Prediction between Manga Line Art Images
HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation
CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation
Thermal is Always Wild: Characterizing and Addressing Challenges in Thermal-Only Novel View Synthesis
RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos
Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models
Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration
TSTM: Temporal Segmentation for Task-related Mask in Visual Reinforcement Learning Generalization
HDW-SR: High-Frequency Guided Diffusion Model based on Wavelet Decomposition for Image Super-Resolution
Self-Corrected Image Generation with Explainable Latent Rewards
MeanFlow Transformers with Representation Autoencoders
S2D: Selective Spectral Decay for Quantization Friendly Conditioning of Neural Activations
Virtual Immunohistochemistry Staining with Dual-Aligned Multi-Task Feature Guidance
Rethinking the Semantic-based Autoencoder
Generalized and Personalized Federated Learning with Black-Box Foundation Models via Orthogonal Transformations
MD2E: Modeling Depth-to-Edge Cues for Monocular Metric Depth Estimation
From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding
Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning
When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness
Unlocking Token Rewards via Training-Free Reward Attribution
VCP-Attack: Visual-Contrastive Projection for Transferable Black-Box Targeted Attacks on Large Vision-Language Models
Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts
A Debiased Reconstruction-based Framework for Training-Free Detection of AI-Generated Images
Spot The Ball: A Benchmark for Visual Social Inference
GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation
FedARA: Resource-adaptive Low-rank Personalized Federated Learning via Anchor-driven Representation Alignment on Heterogeneous Edge Devices
Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence
Temporal Inversion for Learning Interval Change in Chest X-Rays
GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance
PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image
EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses
Precise Object and Effect Removal with Adaptive Target-Aware Attention
GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents
FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting
MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning
MemFlow: A Lightweight Forward Memorizing Framework for Quick Domain Adaptive Feature Mapping
SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving
Event-Based Motion Deblurring Using Task-Oriented 3D Gaussian Event Representations
TaskForce: Cooperative Multi-agent Reinforcement Learning for Multi-task Optimization
Unpaired Deep Image Deraining Using Reward-Guided Self-Reinforcement Learning
AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM
EMR-SM: Explicit Mesh Reconstruction with Dynamic Topology Adaptation
Evaluating Generative Models via One-Dimensional Code Distributions
StreamRAG: Enhancing Real-Time Video Understanding with Retrieval Augmentation
Cluster-aware Anchor Learning for Multi-View Clustering
FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention
GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution
All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models
UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
MaxMark: High-Capacity Diffusion-Native Watermarking via Robust and Invertible Latent Embedding
Image Diffusion Preview with Consistency Solver
ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation
Dual Ascent Diffusion for Inverse Problems
OVI-MAP: Open-Vocabulary Instance-Semantic Mapping
MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models
Tunable Soft Equivariance with Guarantees
Learning to Infer Parameterized Representations of Plants from 3D Scans
Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving
AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Affordance Correspondence
REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting
LATTICE: Democratize High-Fidelity 3D Generation at Scale
Frequency-Aware Flow Matching for High-Quality Image Generation
I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models
Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation
Factorize, Reconstruct, Enhance: A Unified Framework for Multimodal Sentiment Analysis
Bulk RNA-seq Guided Multi-modal Detection of Anomalous Regions in Human Cancer via Spatial Transcriptomics
EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing
Opti-NeuS: Neural Reconstruction for Dual-Layered Transparent and Opaque Objects
Tri-Modal Fusion Transformers for UAV-based Object Detection
Time-Specialized Event-Image Alignment for Blur-to-Video Decomposition
SAGE: Scalable Agentic 3D Scene Generation for Embodied AI
Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets
Adaptive Action Chunking at Inference-time for Vision-Language-Action Models
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
AstraNav-Memory: Contexts Compression for Long Memory
Leveraging Verifier-Based Reinforcement Learning in Image Editing
Geo$^\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis
Compressed-Domain-Aware Online Video Super-Resolution
SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World
WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments
TopoHR: Hierarchical Centerline Representation for Cyclic Topology Reasoning in Driving Scenes with Point-to-Instance Relations
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
TAR: Token-Aware Refinement for Fine-grained Generalized Category Discovery
SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection
AnchorFlow: Training-Free 3D Editing via Latent Anchor-Aligned Flows
R3-PCQA: Ray-Reprojection-Reinforcement for No-Reference 3D Point Cloud Quality Assessment
Depth Any Endoscopy: Towards Self-Supervised Generalizable Depth Estimation in Monocular Endoscopy
Training-free, Perceptually Consistent Low-Resolution Previews with High-Resolution Image for Efficient Workflows of Diffusion Models
ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation
ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals
SplitFlux: Learning to Decouple Content and Style from a Single Image
SGDE: Self-supervised Geometry Degradation Estimation Framework for Coded Aperture Compressive Spectral Imaging
Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration
PP-Brep: Few-Shot B-rep Classification with Hybrid Graph Representation
Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning
Learning Straight Flows: Variational Flow Matching for Efficient Generation
ReLaGS: Relational Language Gaussian Splatting
Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models
AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors
WorldGen: From Text to Traversable and Interactive 3D Worlds
VisiLock: Authorizing Instruction-based Image editing with Dual Score Distillation
InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space
Neural Dynamic GI: Random-Access Neural Compression for Temporal Lightmaps in Dynamic Lighting Environments
DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos
Forecasting 3D Scanpaths in Egocentric Video
Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos
Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation
From Attraction to Equilibrium: Physics-Inspired Semantic Gravitons for Zero-Shot Anomaly Detection
Focal–General Diffusion Model with Semantic Consistent Guidance for Sign Language Production
Adaptive Bayesian Early-Exit Networks for Efficient Non-Transferable Learning
WorldStereo: Bridging Controllable Video Generation and Scene Reconstruction via 3D Geometric Memories
InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping
Uika: Universal Head Avatar from Pose-Free Images
Defect Cue-Preserved Structural Feature Refinement for Few-Shot Anomaly Detection
Self-Attention Driven Tensor Representation for High-Order Data Recovery
Learning Effective Sign Features without Text for Gloss-free Sign Language Translation
Prompt-Free Unknown Label Generation for Open World Detection in Remote Sensing
Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images
InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting
Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation
High-Precision Dichotomous Image Segmentation via Depth Integrity-Prior and Fine-Grained Patch Strategy
V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs
HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks
RoboTAG: End-to-end Robot Pose Estimation via Topological Alignment Graph
Restore, Assess, Repeat: A Unified Framework for Iterative Image Restoration
TokenLight: Precise Lighting Control in Images using Attribute Tokens
The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulated Visual Contexts
AvatarPointillist: AutoRegressive 4D Gaussian Avatarization
Does YOLO Really Need to See Every Training Image in Every Epoch?
Decoding 3D Perception via BrainSSD: Synergistic Fusion of EEG Representations from Static and Dynamic Visual Streams
CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization
PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing
VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh Recovery
OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning
FLARE: A Failure-Aware Framework for Autonomous Correction and Recovery in Visual-Language Robotic Manipulation
GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling
Reflection Separation from a Single Image via Joint Latent Diffusion
EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation
MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation
Hypergraph-State Collaborative Reasoning for Multi-Object Tracking
EgoX: Egocentric Video Generation from a Single Exocentric Video
Rethinking Visual Rearrangement from A Diffusion Perspective
MetricHMSR: Metric Human Mesh and Scene Recovery from Monocular Images
ColorFLUX: A Structure-Color Decoupling Framework for Old Photo Colorization
Deciphering Genotype-Phenotype Mechanisms from High-Content Profiling via Knowledge-Guided Multi-modal Graph Learning
Uncertainty-Aware Modality Fusion for Unaligned RGB-T Salient Object Detection
STCast: Adaptive Boundary Alignment for Global and Regional Weather Forecasting
The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation
Extending Embodied Question Answering from Perception to Decision
Weight Space Representation Learning with Neural Fields
$\texttt{MonoVLM}$: Monocular 3D Visual Grounding with Vision Language Models
DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment
Edge-RecViT: Efficient Vision Transformer via Semantic-Refined Dynamic Recursion
LumiMotion: Improving Gaussian Relighting with Scene Dynamics
ODGS-SLAM: Omnidirectional Gaussian Splatting SLAM
Reframing Long-Tailed Learning via Loss Landscape Geometry
CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects
Thinking with Programming Vision: Towards a Unified View for Thinking with Images
GSV2X: Geometry-Aware Uncertainty Modeling and Orthogonal Fusion for Robust Roadside Perception
SpotEdit: Selective Region Editing in Diffusion Transformers
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
D$^3$FER: Dual Channel and Dual Branch Network for Robust Facial Expression Recognition under Dual Noise
Beyond Global Similarity: Multi-Conditional Retrieval for Fine-Grained Cross-Modal Understanding
Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors
Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations
Compositional Transformation Reasoning for Composed Video Retrieval
TacSIm: A Dataset and Benchmark for Football Tactical Style Imitation
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs
FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Alignment
MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models
OpenDance: Multimodal Controllable 3D Dance Generation with Large-scale Internet Data
LongStream: Long-Sequence Streaming Autoregressive Visual Geometry
ProgTrack: A Multi-Object Tracking Algorithm with Progressive Matching Strategy
VideoMaMa: Mask-Guided Video Matting via Generative Prior
EG-3DVG: Expression and Geometry Aware Grounding Decoder for 3D Visual Grounding
Revisiting Monocular SLAM with Spatio-Temporal Scene Modeling
FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching
Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection
TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond
OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis
AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models
Beyond Soft Label: Dataset Distillation via Orthogonal Gradient Matching
E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought
OMGTex: One-stage Multi-style Facial Texture Reconstruction without Geometry Guidance
FireScope: Wildfire Risk Raster Prediction With a Chain-of-Thought Oracle
Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models
Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos
RAYNOVA: Geometry-Free Auto-Regressive 4D World Modeling with Unified Spatio-Temporal Representation
PhysGaia: A Physics-aware Benchmark with Multi-Body Interactions for Dynamic Novel View Synthesis
Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation
Asynchronous Temporal Modeling with Two-Agent Framework for Streaming Dense Video Captioning
Matching Every Pair to Track Every Point: PairFormer for All-Pairs Tracking and Video Trajectory Fields
TM-BSN: Triangular-Masked Blind-Spot Network for Real-World Self-Supervised Image Denoising
Test-Time Instance-Specific Parameter Composition: A New Paradigm for Adaptive Generative Modeling
Self-Critical Distillation Network for Video-based Commonsense Captioning
3D Instance Models are Implicit Generalizable Spatial Learners
Aligning Multi-Character Narrative Image Generation with Multi-Aspect Human Preferences
Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment
Structural Action Transformer for 3D Dexterous Manipulation
ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting
Cleaning the Pool: Progressive Filtering of Unlabeled Pools in Deep Active Learning
INSIGHT Bench: Towards Grounded IN-SItu Guidance for Robotic ManipulaTion
SeeLe: A Unified Acceleration Framework for Real-Time Gaussian Splatting on Mobile Devices
CAR-SAM: Cross-Attention Reconstruction for Post-Training Quantization of the Segment Anything Model
AdvFM: Lookahead Flow-Matching Velocity-Field Attacks for Imperceptible and Transferable Adversarial Examples
LangRef3DGS: Natural Language-Guided 3D Referential Segmentation from Partial Observations via 3D Gaussian Splatting
ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation
SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution
μVLM: A Vision Language Model for μNPUs
Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
AutoMoMa: Scalable Coordinated Mobile Manipulation Trajectory Generation
Talking Together: Synthesizing Co-Located 3D Conversations from Audio
EduDiag: A Benchmark for Educational Diagnostic Reasoning with Error Tracing and Correction on Large Multimodal Models
SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker
PositionIC: Unified Position and Identity Consistency for Image Customization
Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation
StreamDiT: Real-Time Streaming Text-to-Video Generation
STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative
Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning
MUST: Modality-Specific Representation-Aware Transformer for Diffusion-Enhanced Survival Prediction with Missing Modality
Disentangling to Re-couple: Resolving the Similarity-Controllability Paradox in Subject-Driven Text-to-Image Generation
Generalizable Sparse-View 3D Reconstruction from Unconstrained Images
Pointing at Parts: Training-Free Few-Shot Grounding in Multimodal LLMs
Distilling Quasi-Conformal Mapping: A Generalizable and Efficient Solution for Wide-Angle Correction
CF-IPT: Cross-Modal Fusion Interactive Prompt Tuning of Vision-Language Pre-Trained Model for Multisource Remote Sensing Data Classification
Latent Action Pretraining Meets Pose Estimation
AntiStyler: Defending Object Detection Models Against Adversarial Patch Attacks Using Style Removal
GenMask: Adapting DiT for Segmentation via Direct Mask Generation
Fine-Tuning Impairs the Balancedness of Foundation Models in Long-tailed Personalized Federated Learning
Cross-Domain Few-Shot Segmentation via Multi-view Progressive Adaptation
Hyperbolic Defect Feature Synthesis for Few-Shot Defect Classification
Pixel2Phys: Distilling Governing Laws from Visual Dynamics
FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution
ReAttnCLIP: Training-Free Open-Vocabulary Remote Sensing Image Segmentation via Re-defined Attention in CLIP
CanonCGT: Reference-Based Color Grading via Canonical Pivot Representation
PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction
Taming Sampling Perturbations with Variance Expansion Loss for Latent Diffusion Models
Incremental Object Detection via Future-Aware Decoupled Cross-Head Distillation
D2FANet: Enhancing Video Object Detection with Dual-Domain Feature Aggregation Network
EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation
ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data
VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding
VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding
Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers
ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos
OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting
MotionMaster: Generalizable Text-Driven Motion Generation and Editing
MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations
Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress
IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment
MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer
Gau-Occ: Geometry-Completed Gaussians for Multi-Modal 3D Occupancy Prediction
Illuminating Visual Identity in Universal Multimodal Embeddings
SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation
DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation
One-Shot Flow, Any-Time Frame: A Bidirectional Warping Framework for Event-Based Video Frame Interpolation
Beyond Single-View Sufficiency: CVBench for Cross-View Human Understanding
Flow Matching for Multimodal Distributions
RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations
Learning to Act Robustly with View-Invariant Latent Actions
OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis
PAS : Prelim Attention Score for Detecting Object Hallucinations in Large Vision-Language Models
HyperNAS: Enhancing Architecture Representation for NAS Predictor via Hypernetwork
Ground Reaction Inertial Poser: Physics-based Human Motion Capture from Sparse IMUs and Insole Pressure Sensors
SinGeo: Unlock Single Model's Potential for Robust Cross-View Geo-Localization
No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models
DreamStyle: A Unified Framework for Video Stylization
PointCNN++: Performant Convolution on Native Points
SineProject: Machine Unlearning for Stable Vision–Language Alignment
Flow3r: Factored Flow Prediction for Visual Geometry Learning
QuietPrune: Query-Guided Early Token Pruning for Vision-Language Models
Pixel Motion Diffusion is What We Need for Robot Control
iSplat: Iterative Learning for Fine-Grained Gaussian Splatting
GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning
ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos
Prototypical Action Reasoning Facilitated by Vision-Language Alignment for Egocentric Action Anticipation
MotionV2V: Editing Motion in a Video
GGPT: Geometry-Grounded Point Transformer
FedSST: Rethinking Fair Federated Graph Learning under Structural Shift
Stable Spike: Dual Consistency Optimization via Bitwise AND Operations for Spiking Neural Networks
Clothe and Pose
Dual-Granularity Memory for Efficient Video Generation
Video Generation with Stable Transparency via Shiftable RGB-A Distribution Learner
Are Image-to-Video Models Good Zero-Shot Image Editors?
InfinityHuman: Towards Long-Term Audio-Driven Human Animation
Seeing through boxes: Non-Line-of-Sight 3D Reconstruction from Radar Signals
Saliency-Guided Representation with Consistency Policy Learning for Visual Unsupervised Reinforcement Learning
InternVideo-Next: Towards World-Understanding Video Models
What Matters in Practical Learned Image Compression
Multinex: Lightweight Low-light Image Enhancement via Multi-prior Retinex
CARLoS: Retrieval via Concise Assessment Representation of LoRAs at Scale
Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning
HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance
CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention
Optical Diffraction-based Convolution for Semiconductor Lithography
Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment
Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis
SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting
Unblur-SLAM: Dense Neural SLAM for Blurry Inputs
3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding
IF-Prune: Information-Flow Guided Token Pruning for Efficient Vision-Language Models
Learning Where to Look and How to Judge: Resolution-agnostic Image Quality Assessment with Quality-aware Saliency
Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant and Efficient Point Cloud Analysis
Chain-of-Models Pre-training: Rethinking Training Acceleration of CLIP Models
D-Prism: Differentiable Primitives for Structured Dynamic Modeling
HandX+: Scaling Up Text-Conditioned Bimanual Motion Generation
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding
Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting
CGHair: Compact Gaussian Hair Reconstruction with Card Clustering
VideoAutoThink: Video Auto Reasoning via Thinking Once, Answering Twice
EgoAVU: Egocentric Audio-Visual Understanding
UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition
3D-IDE: 3D Implicit Depth Emergent
MMFace-DiT: A Dual-Stream Diffusion Transformer for Multimodal Face Generation
Rascene: High-Fidelity 3D Scene Imaging with mmWave Communication Signals
MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization
Fast SceneScript: Accurate and Efficient Structured Language Model via Multi-Token Prediction
Lynx: Towards High-Fidelity Personalized Video Generation
When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm
Scaling Spatial and Temporal Context for Robotic Imitation Learning Policies With Scene Graphs
LAMP: Language-Assisted Motion Planning for Controllable Video Generation
EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
Composite-Attribute Person Re-Identification via Pose-Guided Disentanglement
Artiverse: A Diverse and Physically Grounded Dataset for Articulated Objects
Live Interactive Training for Video Segmentation
Neural Differentiation in Deep Networks: A Theoretical Framework for Expressivity and Representational Diversity
Spherical Leech Quantization for Visual Tokenization and Generation
Aesthetic Camera Viewpoint Suggestion with 3D Aesthetic Field
CaT-GS: Efficient 3DGS Rendering for Large Scale Scenes via Inter-frame Caching and Tile Scheduling
Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Langugae Model Blindness
ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images
Ego-STAR: Spatiotemporal Hints for Egocentric Video Understanding
Linear Fundamental Matrix Estimation from 7 or 5 Points
Hermite Radial Basis Function for Surface Reconstruction via Differentiable Rendering
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
Learning Convex Decomposition via Feature Fields
X-band Radar Non-Line-of-Sight Imaging
Inter-Photon-Limited Videography
MuM: Multi-View Masked Image Modeling for 3D Vision
Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers
Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation
Universal Guideline-Driven Image Clustering via a Hybrid LLM Agent
StreamVLO: Streaming Visual–LiDAR Odometry with Cumulative Drift Compensation
ORBIT: Benchmarking SfM in the Wild with 360° Video
MoLingo: Motion–Language Alignment for Text-to-Human Motion Generation
Hyperbolic Gramian Volumes for Multimodal Alignment
MIM Representations Encode Non-Semantic Noise: Post-Hoc Suppression Boosts Zero-Shot Performance
Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences
Omni-Attack: Adversarial Attacks on Open-Ended VQA in Black-Box Multimodal LLMs
Unleashing Stealthy Backdoor Pandemic by Infecting a Single Diffusion Model
OptiMVMap: Offline Vectorized Map Construction via Optimal Multi-vehicle Perspectives
K$\alpha$LOS finds Consensus: A Meta-Algorithm for Evaluating Inter-Annotator Agreement in Complex Vision Tasks
Detecting Unknown Objects via Energy-based Separation for Open World Object Detection
Coverage Optimization for Camera View Selection
From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition
PhysVid: Physics Aware Local Conditioning for Generative Video Models
From Events to Clarity: The Event-Guided Diffusion Framework for Dehazing
Scan Clusters, Not Pixels: A Cluster-Centric Paradigm for Efficient Ultra-high-definition Image Restoration
From Where Things Are to What They Are For: Benchmarking Spatial–Functional Intelligence in Multimodal LLMs
PromptMoE: A Segmentation Refinement Framework Leveraging Mixture of Experts for Improved Prompting
Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds
In Pursuit of Pixel Supervision for Visual Pre-training
CamDirector: Towards Long-Term Coherent Video Trajectory Editing
MagicQuill V2: Precise and Interactive Image Editing with Layered Visual Cues
ELVIS: Enhance Low-light for Video Instance Segmentation in the Dark
WalkGPT: Grounded Vision–Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation
Synthesizing Visual Concepts as Vision-Language Programs
PRUE: A Practical Recipe for Field Boundary Segmentation at Scale
Push-and-Step: From RL-Based Balance Recovery to Physical Simulation of Dense Crowds
VecGlypher: Unified Vector Glyph Generation with Language Models
Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post‑hoc Debiasing in Vision-Language Models
RAVEN: Radar Adaptive Vision Encoders for Efficient Chirp-wise Object Detection and Segmentation
Same or Not? Enhancing Visual Perception in Vision-Language Models
SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation
ReaGEN: Adaptive Generation of Structured Chains-of-Thought for Efficient Multimodal Reasoning
Efficient Decentralized Diffusion with Heterogeneous Training Objectives
Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning
SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control
Lite Any Stereo: Efficient Zero-Shot Stereo Matching
From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis
HalluGen: Synthesizing Realistic and Controllable Hallucinations for Evaluating Image Restoration
SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation
OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective
Unsupervised Multi-agent and Single-agent Perception from Cooperative Views
The Missing GAP: From Solving Square Jigsaw Puzzles To Handling Real World Archaeological Fragments
Thinking in 360°: Humanoid Visual Search in the Wild
VGGT-$\Omega$
EI-Part:Explode for Completion and Implode for Refinement
Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
TextFM: Robust Semi-dense Feature Matching with Language Guidance
Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation
PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation
Globally Optimal Pose from Silhouettes
GP-4DGS: Probabilistic Analysis of 4D Gaussian Splattings for Monocular Video Reconstruction via Variational Gaussian Processes
Selection-as-Nonlinearity: Bridging Attention and Activation via a Joint Game–Decision Lens for Interpretable, Discriminative Visual Representations
Radiance Meshes for Volumetric Reconstruction
Learning 3D Reconstruction with Priors in Test Time
RAPID: Reusing Attention Sparsity with Inter-step Adaptation for Efficient Video Diffusion
Grounded Latents for Entity-Centric 4D Scene Generation
A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks
Grounded 3D-Aware Spatial Vision-Language Modeling
WRIVINDER: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery
Decoupling Vision and Language: Codebook Anchored Visual Adaptation
RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing
DRiffusion: Draft-and-Refine Process Parallelizes Diffusion Models with Ease
Is the Modality Gap a Bug or a Feature? A Robustness Perspective
Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction
DiffSoup: Direct Differentiable Rasterization of Triangle Soup for Extreme Radiance Field Simplification
SAM 3D Body: Robust Full-Body Human Mesh Recovery
SAM 3D: 3Dfy Anything in Images
Continual Learning for fMRI-Based Brain Disorder Diagnosis via Functional Connectivity Matrices Generative Replay
We use cookies to store which papers have been visited.
I agree
Successful Page Load
We use cookies to store which papers have been visited.
I agree