Skip to yearly menu bar
Skip to main content
Main Navigation
CVPR
Code of Conduct
Create Profile
Privacy Policy
Contact CVPR
HELP/FAQ
Reset / Forgot Password
My Stuff
Reset Password
Login
Select Year: (2026)
2026
2025
2024
2023
Home
Schedule
Workshops
Tutorials
Keynotes
Orals
Papers
Demos
Art Program
Sponsors
Organizers
Awards
Highlights
Award Candidates
Layout:
mini
compact
topic
detail
×
No topics available
No sessions available
title
author
topic
session
shuffle
by
serendipity
bookmarked first
visited first
not visited first
bookmarked but not visited
Enable Javascript in your browser to see the papers page.
TempoControl: Temporal Attention Guidance for Text-to-Video Models
DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving
Endless World: Real-Time 3D-Aware Long Video Generation
Lenses: Toward Polysemous Vision–Language Understanding
Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs
GraspALL: Adaptive Structural Compensation from Illumination Variation for Robotic Garment Grasping in Any Low-Light Conditions
DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance
Efficient Unrolled Networks for Large-Scale 3D Inverse Problems
Adversarial Style Optimization: Enhancing VLM Jailbreaks by GRPO-based Stylistic Triggers Optimization
ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning
TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models
ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior
ViT^3: Unlocking Test-Time Training in Vision
Advancing Image Classification with Discrete Diffusion Classification Modeling
Does YOLO Really Need to See Every Training Image in Every Epoch?
NuWa: Deriving Lightweight Class-Specific Vision Transformers for Edge Devices
Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion
ManifoldNeuS: Manifold-aware View Optimizability for Pose-Free Neural Surface Reconstruction
MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second
Complet4R: Geometric Complete 4D Reconstruction
Learning Compact 3D Representations from Feed-Forward Novel View Synthesis
MetroGS: Efficient and Stable Reconstruction of Geometrically Accurate High-Fidelity Large-Scale Scenes
Flow3r: Factored Flow Prediction for Scalable Visual Geometry Learning
MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
ConsistCompose: Unified Multimodal Layout Control for Image Composition
A Training-Free Style-Personalization via SVD-Based Feature Decomposition
Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation
ImageRAGTurbo: Towards One-step Text-to-Image Generation with Retrieval-Augmented Diffusion Models
FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation
EmoStyle: Emotion-Driven Image Stylization
IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator–Critic Framework
Reasoning Diffusion for Unpaired Test Time Out-of-distribution Text-Image to Video Generation
MTA: Multimodal Task Alignment for BEV Perception and Captioning
β-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment
FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Alignment
Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs
EvoGraph-R1: Self-Evolving Multimodal Knowledge Hypergraphs for Agentic Retrieval
Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning
Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy
Label-Free Cross-Task LoRA Merging with Null-Space Compression
GeCo: Geometry-Consistent Regularization for Domain Generalized Semantic Segmentation
Event-based Motion Deblurring with Unpaired Data
Event-based Visual Deformation Measurement
Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo
Adaptive Spatial-Temporal Window: Unlocking the Potential of Event Cameras in Heterogeneous Velocity Scenarios
InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy
GeniNav: Generative Model Driven Image-Goal Navigation via Imagination-Guided Consistency Flow Matching
Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation
Arcadia: Toward a Full-Lifecycle Framework for Embodied Lifelong Learning
ORV: 4D Occupancy-centric Robot Video Generation
Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models
Learning Personalized Photographic Style from Pairwise User Preferences
Efficient Weighted Sampling via Score-based Generative Models
FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain
Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists
DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images
Retrieve-to-Restore: Efficient All-in-One Image Restoration with a Retrieval-Based Degradation Bank
MRI Contrast Enhancement Kinetics World Model
Rethinking Knowledge Transfer in Image Quality Assessment: A Perceptual Preference Structure Alignment Perspective
LF-BVN: Blind-View Network for Self-Supervised Light Field Denoising
Towards Generalized Representations for Low-Light Understanding: When Signal Constancy Meets Semantic Enrichment
Efficient Real-Time Raw-to-Raw Denoising for Extreme Low-Light Ultra HD Video on Mobile Devices
SemiGDA: Generative Dual-distribution Alignment for Semi-Supervised Medical Image Segmentation
Diffusion-Based Native Adversarial Synthesis for Enhanced Medical Segmentation Generalization
Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset
Towards Human-Imperceptible Backdoor Attacks on Text-to-Image Diffusion Models
DualMirage: Hunting Stealthy Multimodal LLM Agents via CAPTCHAs with Contour and Adversarial Illusions
Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints
A Unified Perspective on Adversarial Membership Manipulation in Vision Models
Shedding Light on VLN Robustness: A Black-box Framework for Indoor Lighting-based Adversarial Attack
OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models
CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning
Learning Anchor in Dual Orthogonal Space for Fast Multi-view Clustering
FAVE: A Structured Benchmark for Fine-Grained Audio-Visual Temporal Evaluation in Multimodal LLMs
Vision-Speech Models: Teaching Speech Models to Converse about Images
Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models
Active Perceptual Inference: A Corticothalamic-Inspired Dynamic Nested Recurrent Network for Multimodal Sentiment Analysis with Incomplete Data
RefAV: Towards Planning-Centric Scenario Mining
UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos
ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation
DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO
Video Generation with Stable Transparency via Shiftable RGB-A Distribution Learner
MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On
NOVA: Sparse Control, Dense Synthesis for Pair-Free Video Editing
Benchmarking Single-Factor Physical Video-to-Audio Generation
EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing
Clothe and Pose
FlowPortal: Residual-Corrected Flow for Training-Free Video Relighting and Background Replacement
Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training
Are Image-to-Video Models Good Zero-Shot Image Editors?
FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters
CASR: A Robust Cyclic Framework for Arbitrary Large-Scale Super-Resolution with Distribution Alignment and Self-Similarity Awareness
Thermal Diffusion Matters: Infrared Spatial-Temporal Video Super-Resolution through Heat Conduction Priors
GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution
FlashCap: Millisecond-Accurate Human Motion Capture via Flashing LEDs and Event-Based Vision
MatMart: Material Reconstruction of 3D Objects via Diffusion
Region-Adaptive Sampling for Diffusion Transformers
ENC-Bench: A Benchmark for Evaluating Multimodal Large Language Models in Electronic Navigational Chart Understanding
Nonparametric Deep Fine-grained Clustering with Low-Rank Guided Vision-Language Model
PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks
HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models
Geoint-R1: Formalizing Multimodal Geometric Reasoning with Dynamic Auxiliary Constructions
Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers
SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models
Intra-class Distribution-guided Generative Hashing with Neighbor Refinement for Cross-modal Retrieval
MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding
RetFormer: Multimodal Retrieval for Enhancing Image Recognition
POGA: Paraphrased and Oppositional Graph Alignment for Fine-Grained Cross-Modal Retrieval
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning
TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning
MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models
MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding
Asynchronous Temporal Modeling with Two-Agent Framework for Streaming Dense Video Captioning
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
One-Shot Flow, Any-Time Frame: A Bidirectional Warping Framework for Event-Based Video Frame Interpolation
PhaseWin Search Framework Enable Efficient Object-Level Interpretation
Hierarchical Concept Embedding & Pursuit for Interpretable Image Classification
C-LaV: Conditional Latent Velocity Field Denoising for Weather-Robust LiDAR Place Recognition
LiDeRe: A Lightweight Readout for Fast and Data-Efficient Dense Prediction
AnyPcc: Compressing Any Point Cloud with a Single Universal Model
CoLC: Communication-Efficient Collaborative Perception with LiDAR Completion
FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting
Neural Distribution Prior for LiDAR Out-of-Distribution Detection
Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models
CAD-Refiner: A Unified Framework for CAD Generation and Iterative Editing
From Measurement to Mitigation: Quantifying and Reducing Identity Leakage in Image Representation Encoders with Linear Subspace Removal
Decoupling Defense Strategies for Robust Image Watermarking
DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models
Submodel Extraction for Efficient and Personalized Federated Learning via Optimal Transport
FedDAP: Domain-Aware Prototype Learning for Federated Learning under Domain Shift
Variation-aware Vision Token Dropping for Faster Large Vision-Language Models
Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients
Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding
QuietPrune: Query-Guided Early Token Pruning for Vision-Language Models
The Devil Is in Gradient Entanglement: Energy-Aware Gradient Coordinator for Robust Generalized Category Discovery
Coordinate Denoising for Non‑Equilibrium Molecular Representation Learning
Meta-Learning In-Context Enables Training-Free Cross Subject Brain Decoding
Measure The Feature Universe: Topology-based Pseudo Labeling and Gravity Consistency for Source-Free Domain Adaptation
Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features
ActiveAD: Planning-Oriented Active Learning for End-to-End Autonomous Driving
DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation
GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving
Consistent Instance Field for Dynamic Scene Understanding
Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions
SAM2Text: Towards Prompt-Free and Multi-Resolution Video Scene Text Segmentation
Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models
Quantum-Gated Task-interaction Knowledge Distillation for Pre-trained Model-based Class-Incremental Learning
Elastic Weight Consolidation Done Right for Continual Learning
Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation
InfinityHuman: Towards Long-Term Audio-Driven Human Animation
One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer
AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models
Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation
Tavatar: Topology-Aware Gaussian Attribute Derivation for Animatable Human Avatars
PhysHead: Simulation-Ready Gaussian Head Avatars
ReWeaver: Towards Simulation-Ready and Topology-Accurate Garment Reconstruction
Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression
Unstitching the Chimera: Frame-Level Risk and Train-Free Mitigation for Video Hallucination
Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding
FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control
EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses
CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization
Spatial-Frequency Collaborative Learning for Occluded Visible-Infrared Person Re-Identification
Mind the Gap: Transferring Labels to Align Object Detection Datasets
SSM-Aware Token-Efficient VMamba via Adaptive Patch Pruning and Merging for Person Re-Identification
RHCNet: Residual-Guided Hierarchical Calibration Network for Robust Underwater Object Detection
Beyond Duality: A Hybrid Framework of Leveraging Shared and Private Features for RGB-Event Object Detection
AKCMamba-YOLO: Selective State Space Models For Real-Time Object Detection
FVBench: Benchmarking Deepfake Video Detection Capability of Large Multimodal Models
When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse
Your One-Stop Solution for AI-Generated Video Detection
HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model
One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers
VMonarch: Efficient Video Diffusion Transformers with Structured Attention
UniPR: Unified Object-level Real-to-Sim Perception and Reconstruction from a Single Stereo Pair
DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces
MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label
SketchVL: Policy Optimization via Fine-Grained Credit Assignment for Chart Understanding and More
SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning
Tea-Adapter: Teacher Adapter for Efficient Conditional Generation
SDGS: Spatial Difference Guided Gaussian Splatting for Simultaneous Localization and 3D Reconstruction
RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting
Distilling Unsigned Distance Function for Surface Reconstruction from 3D Gaussian Splatting
DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures
E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction
DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization
VAD-GS: Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes
RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cue for 3D Object Detection
ActivePolicy: Active Gaussian Reconstruction and Optimization Strategy Based on Global-Local Information Gain
SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting
SparseSplat: Towards Applicable Feed-Forward 3D Gaussian Splatting with Pixel-Unaligned Prediction
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering
CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning
Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees
Hugging Visual Prompt and Segmentation Tokens: Consistency Learning for Fine-Grained Visual Understanding in MLLMs
Beyond Depth: Evaluating the Width-centric Reasoning Capability of MLLMs
Perceptual 3D Simulation With Physical World Modeling
CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering
Multi-Scale Local Speculative Decoding for Image Generation
Beyond Single Solution: Multi-Hypothesis Deep Unfolding Network for Image Compressive Sensing
Discovering Adaptive Task Dependencies for Efficient Multi-Task Representation Compression
Perceptual Neural Video Compression with Color Separation and Rank Chain
GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction
PAUL: Uncertainty-Guided Partition and Augmentation for Robust Cross-View Geo-Localization under Noisy Correspondence
Watch and Learn: Learning to Use Computers from Online Videos
Incentivizing Versatile Video Reasoning in MLLMs via Data-Efficient Reinforcement Learning
Act2See: Emergent Active Visual Perception for Video Reasoning
ViLoMem: Agentic Learner with Grow-and-Refine Multimodal Semantic Memory
ReMoT: Reinforcement Learning with Motion Contrast Triplets
Incentivizing Generative Zero-Shot Learning via Outcome-Reward Reinforcement Learning with Visual Cues
LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models
Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models
Lens Component Deletion based on Differentiable Ray Tracing
UAVLight: A Benchmark for Illumination-Robust 3D Reconstruction in Unmanned Aerial Vehicle (UAV) Scenes
GFRRN: Explore the Gaps in Single Image Reflection Removal
Efficient All-Pairs Correlation Volume Sampling for Optical Flow Estimation
HyperST: Hierarchical Hyperbolic Learning for Spatial Transcriptomics Prediction
SO(3)-Equivariant ViT-Adapter for Data-Efficient Zero-Shot Sim-to-Real Indoor Panoramic Depth Estimation
XPaintNet: An eXtreme Lightweight Framework for Stereoscopic Conversion without Inpainting Network
LiteSense: Lifting Lightweight ToF with RGB for High-Resolution Metric Depth Estimation
MD2E: Modeling Depth-to-Edge Cues for Monocular Metric Depth Estimation
The Midas Touch for Metric Depth
WonderZoom: Multi-Scale 3D World Generation
PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction
Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image
MeshWeaver: Sparse-Voxel-Guided Surface Weaving for Autoregressive Mesh Generation
Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction
CraftMesh: High-Fidelity Generative Mesh Manipulation via Poisson Seamless Fusion
MaskFocus: Focusing Policy Optimization on Critical Steps for Masked Image Generation
GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping
The Drift Kernel: Why Diffusion Models Change Even When Told Not To
The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation
Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models
SMV-EAR: Bring Spatiotemporal Multi-View Representation Learning into Efficient Event-Based Action Recognition
Gamba: Mamba-based graph convolutional network with dynamic graph topology learning for action recognition
Beyond Binary Contrast: Modeling Continuous Skeleton Action Spaces with Transitional Anchors
RoboWheel: A Data Engine from Real-World Human Demonstrations for Cross-Embodiment Robotic Learning
NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training
Beyond Euclidean Gossip: KL-Barycentric Consensus on Heterogeneous and Imbalanced Images
WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces
MangoBench: A Benchmark for Multi-Agent Goal-Conditioned Offline Reinforcement Learning
MMBench-GUI: A Unified Hierarchical Evaluation Framework for Multi-Platform GUI Agents
Unlearning without Forgetting: Securely Removing Targeted Concepts from Large-Scale Vision-Language Open-Vocabulary Detectors
S2C2Seg: Semantic-Spatial Consistency and Category Optimization for Open-Vocabulary Segmentation
PromptMoE: A Segmentation Refinement Framework Leveraging Mixture of Experts for Improved Prompting
High-Precision Dichotomous Image Segmentation via Depth Integrity-Prior and Fine-Grained Patch Strategy
GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation
Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge
Real-Time Long Horizon Air Quality Forecasting via Group-Relative Policy Optimization
Beyond Tie Points: Satellite Image Block Adjustment based on Dense Feature Consistency
SARMAE: Masked Autoencoder for SAR Representation Learning
RaPA: Enhancing Transferable Targeted Attacks via Random Parameter Pruning
All Vehicles Can Lie: Efficient Adversarial Defense in Fully Untrusted-Vehicle Collaborative Perception via Pseudo-Random Bayesian Inference
RevINN: An End-to-End Invertible Neural Network for Reversible Adversarial Examples Generation
Scalable Feature Matching via State Space Modeling and Sparse Correlation
Video2Robo: 3DGS-based Synthetic Data from One Video Enables Scalable Robot Learning
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
AGiLe: Learning Robust Long-Horizon Manipulation via Affordance-Grounded Bidirectional Latent Planning
Learning to Act Robustly with View-Invariant Latent Actions
FlowFM: Advancing Dark Optical Flow Estimation with Flow Matching
No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors
From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking
Momentum Memory for Knowledge Distillation in Computational Pathology
Joint Spectral Image Reconstruction and Semantic Segmentation with Cooperative Unfolding
X-WIN: Building Chest Radiograph World Model via Predictive Sensing
Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model
TIM: Temporal Decoupling with Iterative Mutual-Refinement Model for Longitudinal Radiology Report Generation
Modeling Spatiotemporal Neural Frames for High Resolution Brain Dynamic
FBTA: Enabling Single-GPU End-to-End Gigapixel WSI Classification with Feature Bridging and Translation Alignment
Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation
MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
Gaussian-Mixture Latent Flow for Stochastic 3D Human Motion Prediction
Beyond Single-View Sufficiency: CVBench for Cross-View Human Understanding
PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning
SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge
Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy
PhyGaP: Physically-Grounded Gaussians with Polarization Cues
Proxy-GS: Unified Occlusion Priors for Training and Inference in Structured 3D Gaussian Splatting
Selfi: Self-improving Reconstruction Engine via 3D Geometric Feature Alignment
Z-Order Transformer for Feed-Forward Gaussian Splatting
FUSER: Feed-Forward Multiview 3D Registration Transformer and SE(3)^N Diffusion Refinement
Affostruction: 3D Affordance Grounding with Generative Reconstruction
Unified Primitive Proxies for Structured Shape Completion
ART: Articulated Reconstruction Transformer
S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs
QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer
D-Prism: Differentiable Primitives for Structured Dynamic Modeling
STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction
LaS-Comp: Zero-shot 3D Completion with Latent–Spatial Consistency
EfficientMonoHair: Fast Strand-Level Reconstruction from Monocular Video via Multi-View Direction Fusion
MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis
GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering
Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation
PureCC: Pure Learning for Text-to-Image Concept Customization
Yume1.5: A Text-Controlled Interactive World Generation Model
PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and VLM-Guided Optimization
FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing
LVLM-Aided Alignment of Task-Specific Vision Models
PG-VTON: Single-Pass Training-Free Virtual Try-On via Patch-Guided Reference Alignment
Linguistic Priors for Visual Decoupling: Towards Symmetric Vision-Brain Alignment
FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation
AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video
Language Does Matter for Cross-Domain Few-Shot Visual Feature Enhancement
Bridging Domain Expertise and Generalization for Performance Estimation
DA-Mamba: Learning Domain-Aware State Space Model for Global-Local Alignment in Domain Adaptive Object Detection
Scaling Dense Event-Stream Pretraining from Visual Foundation Models
Time-Specialized Event-Image Alignment for Blur-to-Video Decomposition
Unsupervised 3d Motion Estimation Using Event Camera
Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning
AstraNav-Memory: Contexts Compression for Long Memory
Test-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models
ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands
ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models
SyncMos: Scalable Motion Synchronisation for Multi-Agent Scene Interaction
IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting
MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models
The Devil is in Attention Sharing: Improving Complex Non-rigid Image Editing Faithfulness via Attention Synergy
Radiance Meshes for Volumetric Reconstruction
CoRoGS: Contextual Gaussian Splatting for Robust Large-Deviation View Synthesis
Real-Time Dynamic Scene Rendering with Controlled Compressibility and Contact Awareness
Splatent: Splatting Diffusion Latents for Novel View Synthesis
Dynamic-Static Decomposition for Novel View Synthesis of Dynamic Scenes with Spiking Neurons
Residual Diffusion Bridge Model for Image Restoration
Rectifying Latent Space for Generative Single-Image Reflection Removal
Towards Generalized Multimodal Homography Estimation
HG-Lane: High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation
FoundIR-v2: Optimizing Pre-Training Data Mixtures for Image Restoration Foundation Model
InsCal: Calibrated Multi-Source Fully Test-Time Prompt Tuning for Object Detection
Divide, Conquer, and Aggregate: Asymmetric Experts for Class-Imbalanced Semi-Supervised Medical Image Segmentation
Rethinking Box Supervision: Bias-Free Weakly Supervised Medical Segmentation
Semi-supervised Echocardiography Video Segmentation via Anchor Semantic Awareness and Continuous Pseudo-label Reforging
MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation
When LoRA Betrays: Backdooring Text-to-Image Models by Masquerading as Benign Adapters
Breaking Multimodal LLM Safety via Video-Driven Prompting
RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces
PureProof: Diffusion-Resistant Black-box Targeted Attack on Large Vision-Language Models
Rethinking Cross-Modal Anchor Alignment for Mitigating Error Accumulation
Inconsistency-aware Multimodal Schrödinger Bridge for Deepfake Localization
Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions
Seeing What Matters: A Training-Free Self-Guided Framework for Multimodal Detail Perception and Reasoning
The Coherence Trap: When MLLM-Crafted Narratives Exploit Manipulated Visual Contexts
Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs
Tri-Subspaces Disentanglement for Multimodal Sentiment Analysis
Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models
THE MORE, THE MERRIER: CONTRASTIVE FUSION FOR HIGHER-ORDER MULTIMODAL ALIGNMENT
UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation
Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator
TouchDream: 3D Object Completion through Imagined Touch
LogCD: Local-to-global Consistency Distillation for Few-step Image Generation
Parallel Jacobi Decoding for Fast Autoregressive Image Generation
CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions
EchoVDiff: Cardiac-Cycle Echocardiography Video Generation from Arbitrary Frame
Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing
Clair Obscur: an Illumination-Aware Method for Real-World Image Vectorization
Lynx: Towards High-Fidelity Personalized Video Generation
First Frame Is the Place to Go for Video Content Customization
Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs
MultiAnimate: Pose-Guided Image Animation Made Extensible
Translating Signals to Languages for sEMG-Based Activity Recognition
Multi-level Causal LLM-based Text-to-Motion Generation with Human Alignment
LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference
GVIS: Generative Vector Image Steganography
MaxMark: High-Capacity Diffusion-Native Watermarking via Robust and Invertible Latent Embedding
GeoRK2: Geometry-Guided Runge–Kutta Integration for Diffusion Transformer Acceleration
A Self-Conditioned Representation Guided Diffusion Model for Realistic Text-to-LiDAR Scene Generation
ViStoryBench: Comprehensive Benchmark Suite for Story Visualization
R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment
A³: Towards Advertising Aesthetic Assessment
Phrase-Grounding-Aware Supervised Fine-Tuning for Chart Recognition via Side-Masked Attention
VL-RouterBench: A Benchmark for Vision–Language Model Routing
UZ3DVG: Unaided Zero-Shot 3D Visual Grounding with Generated Language Conditions
CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation
GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning
Geometry-Guided 3D Visual Token Pruning for Video-Language Models
Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation
PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning
Tackling Alignment Ambiguity in Person Retrieval through Conversational Attribute Mining
What Is the Optimal Ranking Score Between Precision and Recall? We Can Always Find It and It Is Rarely F1
Robust Remote Sensing Image–Text Retrieval with Noisy Correspondence
SPDMark: Selective Parameter Displacement for Robust Video Watermarking
Memory Matters: Boosting Training-Free Zero-Shot Temporal Action Localization with a Learnable Lookup Table
TVHighlights: LLM-Guided Human-Free Collaborative Training for Video Highlight Detection in Movies and TV Dramas
Learning Effective Sign Features without Text for Gloss-free Sign Language Translation
META: Meta Evolution of Tool Trajectory Adaptation for Long-Video Understanding
Align Once to Explain: Feature Alignment for Scalable B-cosification of Foundational Vision Transformers
Rethinking Concept Bottleneck Models: From Pitfalls to Solutions
UniCorrn: Unified Correspondence Transformer Across 2D and 3D
Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning
Geometric-Aware Hypergraph Reasoning for Novel Class Discovery in Point Cloud Segmentation
U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences
TerraSeg: Self-Supervised Ground Segmentation for Any LiDAR
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection
Masking Teacher and Reinforcing Student for Distilling Vision-Language Models
Role-SynthCLIP: A Role-Play Driven Diverse Synthetic Data Approach
PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow
MapRoute:Precise-Concept Erasing Mappers via Semantic Routing
Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching
Enhancing Visual Representation with Textual Semantics: Textual Semantics-Powered Prototypes for Heterogeneous Federated Learning
FedHarmony: Harmonizing Heterogeneous Label Correlations in Federated Multi-Label Learning
GDFA: Geometry-Driven Federated Unlearning with Directional Task Vector Alignment
InterRVOS: Interaction-Aware Referring Video Object Segmentation
RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding
Learning to Refuse: Refusal-Aware Reinforcement Fine-Tuning for Hard-Irrelevant Queries in Video Temporal Grounding
GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
MeToM: Metadata-Guided Token Merging for Efficient Video LLMs
Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models
Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding
TAMER: A Tri-Modal Contrastive Alignment and Multi-Scale Embedding Refinement Framework for Zero-Shot ECG Diagnosis
Neural Collapse in Test-Time Adaptation
Neuro-Cognitive Reward Modeling for Human-Centered Autonomous Vehicle Control
E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving
Den-TP: A Density-Balanced Data Curation and Evaluation Framework for Trajectory Prediction
Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
Test-Time Multi-Prompt Adaptation for Open-Vocabulary Remote Sensing Image Segmentation
ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes
CrackSSM: Reviving SSMs for Crack Segmentation via Dynamic Scanning
Dual-Prototype-Guided Multi-task Learning for Unsupervised Anomaly Detection and Classification
Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation
Parameter-efficient Continual Learning for Enhancing Plasticity without Forgetting under Limited Model Capacity
Continual Distillation of Teachers from Different Domains
Learning from Itself: Mining Internal Knowledge from Vision Language Models for Continual Learning
HAD: Heterogeneity-Aware Distillation for Lifelong Heterogeneous Learning
U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation
StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars
WildCap: Facial Albedo Capture in the Wild via Hybrid Inverse Rendering
TRM-VLA: Temporal-Aware Chain-of-Thought Reasoning and Memorization for Vision-Language-Action Models
VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving
HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation
CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Bird’s-Eye-View Semantic Segmentation
Globally Optimal Pose from Orthographic Silhouettes
Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding
AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM
HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models
One Token, Two Fates: A Unified Framework via Vision Token Manipulation Against MLLMs Hallucination
Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion
Scaling4D: Pushing the Frontier of Video Novel View Synthesis through Large-Scale Monocular Videos
WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling
Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer
D2FANet: Enhancing Video Object Detection with Dual-Domain Feature Aggregation Network
HierUQ: Hierarchical Uncertainty Quantification with Adaptive Granularity Reconciliation for Degraded Image Classification
Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces
COPE: Consistent Occlusion and Prompt Enhancement Network for Occluded Person Re-identification
Assignment-Driven Hash Learning in a Hyper-Semantic Space for On-the-Fly Category Discovery
DyFCLT: Dynamic Frequency-Decoupled Cross-Modal Learning Transformer for Multimodal Tiny Object Detection
Building a Precise Video Language with Human–AI Oversight
CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection
Towards Sparse Video Understanding and Reasoning
MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering
TiViBench: Benchmarking Think-in-Video Reasoning for Video Generation
CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance
Guiding a Diffusion Transformer with the Internal Dynamics of Itself
CoopDiff: A Diffusion-Guided Approach for Cooperation under Corruptions
COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation
Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception
Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR
Beyond Graph Model: Reliable VLM Fine-Tuning via Random Graph Adapter
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives
MSCD-GS: Motion-Separated Cooperative Deblurring Dynamic Reconstruction via Gaussian Splatting
iSplat: Iterative Learning for Fine-Grained Gaussian Splatting
MAPo: Motion-Aware Partitioning of Deformable 3D Gaussian Splatting for High-Fidelity Dynamic Scene Reconstruction
FreeArtGS: Articulated Gaussian Splatting Under Free-moving Scenario
SharpTimeGS: Sharp and Stable Dynamic Gaussian Splatting via Lifespan Modulation
4C4D: 4 Camera 4D Gaussian Splatting
SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting
PRIMU: Uncertainty Estimation for Novel Views in Gaussian Splatting from Primitive-Based Representations of Error and Coverage
Disco-GS: Gaussian Splatting in Dynamic Color Lighting
GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision
Will Multimodal Models Be Dazzled by Multi-Image Visual Puzzles?
CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal Reasoning
Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models
Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization
BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers
Ultra-Low Bitrate Perceptual Image Compression with Shallow Encoder
Spk2VidNet: A Hierarchical Recurrent Architecture for High-Fidelity Video Reconstruction from Long Spike-Camera Streams
Adaptive Learned Image Compression with Graph Neural Networks
SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation
Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs
APPO: Attention-guided Perception Policy Optimization for Video Reasoning
Visual Document Understanding and Reasoning: A Multi-Agent Collaboration Framework with Agent-Wise Adaptive Test-Time Scaling
Bridging Human Evaluation to Infrared and Visible Image Fusion
Beyond Strict Pairing: Arbitrarily Paired Training for High-Performance Infrared and Visible Image Fusion
From Few-way to Many-way: Rethinking Few-shot Fine-grained Image Classification
STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval
FlowComposer: Composable Flows for Compositional Zero-Shot Learning
CamPI: Physical Adversarial Examples through Camera Power Signal Injection
UniRain: Unified Image Deraining with RAG-based Dataset Distillation and Multi-objective Reweighted Optimization
Differentiable Adaptive 4D Structured Illumination for Joint Capture of Shape and Reflectance
GSNR: Graph Smooth Null-Space Representation for Inverse Problems
αMatte4K & µMatting: Dataset and Model for Ultra-Micro Precision Alpha Video Matting
Dynamic Black-hole Emission Tomography with Physics-informed Neural Fields
Generalizable Radio-Frequency Radiance Fields for Spatial Spectrum Synthesis
GeoRelight: Learning Joint Geometrical Reconstruction and Relighting with Flexible Multi-Modal Diffusion Transformers
Paparazzo: Active Mapping of Moving 3D Objects
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass
Vinedresser3D: Towards Agentic Text-guided 3D Editing
MV2UV: Generating High-quality UV Texture Maps with Multiview Prompts
MeshRipple: Structured Autoregressive Generation of Artist-Meshes
Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow
CUPID: Generative 3D Reconstruction via Joint Object and Pose Modeling
VA-π: Variational Policy Alignment for Pixel-Aware Autoregressive Generation
SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models
AnyID: Ultra-Fidelity Universal Identity-Preserving Video Generation from Any Visual References
Style-GRPO: Semantic-Aware Preference Optimization for Image Style Transfer Guided by Reward Modeling
Spectral Scalpel: Amplifying Adjacent Action Discrepancy via Frequency-Selective Filtering for Skeleton-Based Action Segmentation
DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning
Dynamic Momentum Recalibration in Online Gradient Learning
E^2-SCI: Elastic Edge–Cloud Speculative Decoding via Credit Inertia
HyperNAS: Enhancing Architecture Representation for NAS Predictor via Hypernetwork
Spectral Conformal Risk Control: Distribution-Free Tail Guarantees via Bayesian Quadrature
Edge-RecViT: Efficient Vision Transformer via Semantic-Refined Dynamic Recursion
ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization
HiconAgent: History Context-aware Policy Optimization for GUI Agents
PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training
Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset
Common Inpainted Objects In-N-Out of Context
Hilbert Curve-Based Attention Enabling Topology-Preserving Image Tensor Representation for Semantic Segmentation Network
Towards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels
Universal 3D Shape Matching via Coarse-to-Fine Language Guidance
Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation
Discriminative Perception via Anchored Description for Reasoning Segmentation
Best Segmentation Buddies for Image-Shape Correspondence
CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation
ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery
F2Net: A Frequency-Fused Network for Ultra-High Resolution Remote Sensing Segmentation
RoadGIE: Towards A Global-Scale Aerial Benchmark for Generalizable Interactive Road Extraction
Lipschitz Optimization for Formal Verification of Homographies
Eliminate Distance Differences Induced by Backdoor Attacks: Layer-Selective Training and Clipping to Mask Backdoor Models
Mitigating Error Amplification in Fast Adversarial Training
What Your Features Reveal: Data-Efficient Black-Box Feature Inversion Attack for Split DNNs
Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation
Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation
UAST: Unified Active Search and Tracking for Arbitrary Targets with UAVs
Cross-Hand Latent Representation for Vision-Language-Action Models
Physically Ground Commonsense Knowledge for Articulated Object Manipulation with Analytic Concepts
From Manuals to Actions: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation
Real-World Point Tracking with Verifier-Guided Pseudo-Labeling
Breaking Smooth-Motion Assumptions: A UAV Benchmark for Multi-Object Tracking in Complex and Adverse Conditions
TrackMAE: Video Representation Learning via Track Mask and Predict
Dual-branch Distilled Transformer for Efficient Asymmetric UAV Tracking
Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers
SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance
Multimodal Causality-Driven Representation Learning for Generalizable Medical Image Segmentation
Beyond the Static-World: Lifelong Learning for All-in-One Medical Image Restoration
RNED: Rotary Number Encoding and Decoding for Quantitative Medical VLM Analysis
MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding
BiOTPrompt: Bidirectional Optimal Transport Guided Prompting for Disease Evolution-aware Radiology Report Generation
Learning to See Through a Baby’s Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines
Enhancing Accuracy of Uncertainty Estimation in Appearance-based Gaze Tracking with Probabilistic Evaluation and Calibration
SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation
KASALv2: Fully Automatic 3D Rotational Symmetry Classification and Axis Localization
Active Inference for Micro-Gesture Recognition: EFE-Guided Temporal Sampling and Adaptive Learning
ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals
ResiHMR: Residual-Limb Aware Single-Image 3D Human Mesh Recovery for Individuals with Limb Loss
MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer
DFD-HR: Generalizable Deepfake Detection via Hierarchical Routing Learning
MGDHand: Multi-Granularity Prior-to-Inertial Distillation Framework for Sequential 3D Hand Pose Estimation from Sparse IMUs
Bézier Degradation Modeling for LiDAR-based Human Motion Capture
Illumination-Consistent Human-Scene Reconstruction from Monocular Video
Agent4FaceForgery: Multi-Agent LLM Framework for Realistic Face Forgery Detection
Enabling Supervised Learning of Generative Signatures for Generalized Synthetic Image Detection
DiffusionFF: A Diffusion-based Framework for Joint Face Forgery Detection and Fine-Grained Artifact Localization
All in One: Unifying Deepfake Detection, Tampering Localization, and Source Tracing with a Robust Landmark-Identity Watermark
Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection
LiDAS: Lighting-driven Dynamic Active Sensing for Nighttime Perception
An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving
OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera
Breaking Semantic Boundaries: Distribution-Guided Semantic Exploration for Creative Generation
Guiding a Diffusion Model by Swapping Its Tokens
Streaming Diffusion Model for Fast Infrared and Visible Video Fusion
CoSMo3D: Open-World Promptable 3D Semantic Segmentation through LLM-Guided Canonical Spatial Modeling
GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding
S^2AM3D: Scale-controllable Part Segmentation of 3D Point Clouds
AnchorFlow: Training-Free 3D Editing via Latent Anchor-Aligned Flows
ChordEdit: One-Step Low-Energy Transport for Image Editing
Native and Compact Structured Latents for 3D Generation
MDCS-MoAME: Multi-directional Composite Scanning with Mixture of Attention and Mamba Experts for Cancer Survival Prediction
PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs
Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video
SMVRT: Implicit Human 3D Modeling Using Sparse Multi-View Volumetric Reconstruction with Transformer Fusion
Any4D: Unified Feed-Forward Metric 4D Reconstruction
AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend
Parallelised Differentiable Straightest Geodesics for 3D Meshes
DVGT: Driving Visual Geometry Transformer
Geometry-Aligned and Anomaly-Aware Reconstruction for 3D Anomaly Detection
Hyper-PCN: Hypergraph-Based Point Cloud Completion via High-Order Correlation Modeling
Foundation Encoders Are All You Need for Preference-Aware Personalization
Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation
CoLoGen: Progressive Learning of Concept–Localization Duality for Unified Image Generation
Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
Visual Personalization Turing Test
Semantic Derivative Flow: Graph-Guided Diffusion for Controllable Instance Interactions
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
PromptEnhancer: Taming Your Rewriter for Text-to-Image Generation via Fine-Grained Reward
Copy-Transform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints
Gravitation-Driven Semantic Alignment for Text Video Retrieval
M^3KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation
Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition
PersonaVLM: Long-Term Personalized Multimodal LLMs
Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation
CUE: Concept-Aware Multi-Label Expansion to Mitigate Concept Confusion in Long-Tailed Learning
CoFiDA-M: Concept-Aware Feature Modulation for Cross-Domain Adaptation with Image-Only Inference
Reclaiming Lost Text Layers for Source-Free Cross-Domain Few-Shot Learning
AE2VID: Event-based Video Reconstruction via Aperture Modulation
From Contrast to Consistency: Rethinking Event-based Continuous-Time Optical Flow Estimation
Spike-driven Discrete Aggregation for Event-based Object Detection
FloVerse: Floor Plan-Guided Multi-Modal Navigation
Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes
Rethinking Visual Rearrangement from A Diffusion Perspective
Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation
InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning
Towards Training-free Scene Text Editing
ArtiMuse: Fine-Grained Image Aesthetics Assessment with Joint Scoring and Expert-Level Understanding
I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models
RF4D:Neural Radar Fields for Novel View Synthesis in Outdoor Dynamic Scenes
FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction
LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis
BiProLoRA: Bilevel Prompt LoRA for Real Scene Recovery
CanonCGT: Reference-Based Color Grading via Canonical Pivot Representation
2-Shots in the Dark: Low-Light Denoising with Minimal Data Acquisition
Restore, Assess, Repeat: A Unified Framework for Iterative Image Restoration
Scan Clusters, Not Pixels: A Cluster-Centric Paradigm for Efficient Ultra-high-definition Image Restoration
It Takes Two: A Duet of Periodicity and Directionality for Burst Flicker Removal
Dynamic Exposure Burst Image Restoration
FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration
ColorFLUX: A Structure-Color Decoupling Framework for Old Photo Colorization
Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities
Dynamic Stream Network for Combinatorial Explosion Problem in Deformable Medical Image Registration
PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention
When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness
Hidden Dangers of Compositional Generation: Diagnosing Semantic Safety Failures in Text-to-Image Models
GenBreak: Red Teaming Text-to-Image Generation Using Large Language Models
Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning
MMCP-GEN: A Modality-Extensible Diffusion Language Model for Conditional Protein Sequence Generation
Few-shot Acoustic Synthesis with Multimodal Flow Matching
Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation
PhenoYieldNet: Learning Crop-Aware Phenological Responses for Multi-Crop Yield Prediction
Conflict-Aware Adaptive Cross-Reconstruction for Multimodal Sentiment Analysis
Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement
ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis
PAM: A Pose–Appearance–Motion Engine for Sim-to-Real HOI Video Generation
Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation
EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing
From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition
PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On
SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation
Plenoptic Video Generation
PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference
OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory
Linear Image Generation by Synthesizing Exposure Brackets
Low-Resolution Editing is All You Need for High-Resolution Editing
MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
Negative Binomial Variational Autoencoders for Overdispersed Latent Modeling
Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods
Dual Graph Regularized Deep Unfolding Network for Guided Depth Map Super-resolution
DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution
VSRELL: A Simple Baseline for Video Super-Resolution and Enhancement in Low-Light Environment
Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset
Next-Scale Autoregressive Models for Text-to-Motion Generation
RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation
Progressive Guessing to Fixed Point: Rethinking Human Motion Prediction with Deep Equilibrium Models
ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data
PatchScene: Patch-based Voxel Diffusion Model for Large-Scale Scene Completion
Prototype-Guided Concept Erasure in Diffusion Models
Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models?
CSF: Black-box Fingerprinting via Compositional Semantics for Text-to-Image Models
MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition–Perception–Reasoning Guided Text-Image Machine Translation
Towards Policy-Adaptive Image Guardrail: Benchmark and Method
TextFM: Robust Semi-dense Feature Matching with Language Guidance
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting
ReMatch: Boosting Representation through Matching for Multimodal Retrieval
Revisiting F-measure Optimization in Multi-Label Classification: A Sampling-based Approach
Modeling the Visual Ambiguity of Human Sketches
SATTC: Structure-Aware Label-Free Test-Time Calibration for Cross-Subject EEG-to-Image Retrieval
V^2-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence
WeaveTime: Streaming from Earlier Frames into Emergent Memory in VideoLLMs
Efficient Frame Selection for Long Video Understanding via Reinforcement Learning
InternVideo-Next: Towards World-Understanding Video Models
Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency
A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett–Luce Ranking
Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing
Explaining Object Detectors via Collective Contribution of Pixels
Evaluating Generative Models via One-Dimensional Code Distributions
BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds
Edges Compete for Trust: Group Relative Edge Optimization for Building Reconstruction from Point Clouds
Unsupervised Monocular 3D Keypoint Discovery from Multi-View Diffusion Priors
QD-PCQA: Quality-Aware Domain Adaptation for Point Cloud Quality Assessment
MS^2Gait: A Multi-Scale Spatio-Temporal Fusion Network for LiDAR-based Gait Recognition
Foundry: Distilling 3D Foundation Models for the Edge
Dual-Level Confidence based Implicit Self-Refinement for Medical Visual Question Answering
FedMPT: Federated Multi-Label Prompt Tuning of Vision-Language Models
SynthRGB-T: Language-Vision Guided Image Translation for Diversity Synthesis
Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs
Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure
Animator-Centric Skeleton Generation on Objects with Fine-Grained Details
Synthesizing Visual Concepts as Vision-Language Programs
Semantic Scale Space: A Framework for Controllable Image Abstraction
Meta-FC: Meta-Learning with Feature Consistency for Robust and Generalizable Watermarking
Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models
EcoAlign: An Economically Rational Framework for Efficient LVLM Alignment
Domain-Skewed Federated Learning with Feature Decoupling and Calibration
Fine-Tuning Impairs the Balancedness of Foundation Models in Long-tailed Personalized Federated Learning
Few-for-Many Personalized Federated Learning
Domain Sensitive Federated Learning with Fisher-Informed Pruning
SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs
AR²-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos
CVA: Context-aware Video-text Alignment for Video Temporal Grounding
ST4R-Splat: Spatio-Temporal Referring Segmentation in 4D Gaussian Splatting
WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens
Towards Unified Human Perception and Machine Understanding: Token Flow Guided Compression Framework
Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention
Tunable Soft Equivariance with Guarantees
Semi-Supervised Conformal Prediction With Unlabeled Nonconformity Score
Revisiting Sparsity Constraint Under High-Rank Property in Partial Multi-Label Learning
Recurrent Video Masked Autoencoders
Revisiting Unknowns: Towards Effective and Efficient Open-Set Active Learning
From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training
Spatial Retrieval Augmented Autonomous Driving
ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving
CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography
URScenes: A Multi-scenario Dataset for Unstructured Road Environments
MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Driving
SAMosaic3D: Modular Scene Assembly for Real-Time 3D Segment Anything
MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation
MARSS: Radar Semantic Segmentation via Modular Attention and State Space Models
Exemplar-Free Class Incremental Learning via Preserving Class-Discriminative Structure
PACT: Phase-Like Transition Constraints in Adapter-Based Continual Learning of Vision-Language Models
Re-evaluating Continual VQA: Toward Fair and Robust Evaluation for Multimodal Continual Learning
Beyond Myopic Alignment: Lookahead Optimization for Online Class-Incremental Learning
EmoDiffTalk: Emotion-aware Diffusion for Editable 3D Gaussian Talking Head
D^3FER: Dual Channel and Dual Branch Network for Robust Facial Expression Recognition under Dual Challenges
ExpPortrait: Expressive Portrait Generation via Personalized Representation
PersonaLive! Expressive Portrait Image Animation for Live Streaming
ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation
CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving
Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image
FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision
UIKA: Fast Universal Head Avatar from Pose-Free Images
Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection
Envision, Attend, Then Respond: Counterfactual Hallucination Mitigation in Large Vision-Language Models
Generative Video Motion Editing with 3D Point Tracks
BulletTime: Decoupled Control of Time and Camera Pose for Video Generation
Learning to Generate Highly Dynamic Videos using Synthetic Motion Data
Stereo World Model: Camera-Guided Stereo Video Generation
VDFE: Difference-Aware 3D Scene Editing with Non-Intrusive Video Diffusion Priors for Multi-View Consistency and Efficiency
SpatialDiff: 3D-Aware Object Movement via Implicit Spatial Modeling
VLM4RSDet: Collaborative Optimization with Vision-Language Model for Enhancing Remote Sensing Object Detection
Object-Generalized Re-Identification: A Step Towards Universal Instance Perception
When Transformers Meet Mamba: A Hybrid Transformer-Mamba Network for Video Object Detection
Prompt-Anchored Vision–Text Distillation for Lifelong Person Re-identification
Beyond Caption-Based Queries in Video Moment Retrieval
VideoRealBench: A Chain-of-Thought Realism Evaluation Benchmark for Generated Human-Centric Videos
An Empirical Study on How Video-LLMs Answer Video Questions
UniComp: Rethinking Video Compression Through Informational Uniqueness
NaTex: Seamless Texture Generation as Latent Color Diffusion
All-in-One Slider for Attribute Manipulation in Diffusion Models
CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception
Unleashing the Power of Chain-of-Prediction for Monocular 3D Object Detection
Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors
Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments
Draft and Refine with Visual Experts
Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
μVLM: A Vision Language Model for μNPUs
Gaussian Mapping for Evolving Scenes
AnchorSplat: Feed-Forward 3D Gaussian Splatting With 3D Geometric Priors
SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM
PhysGS: Bayesian-Inferred Gaussian Splatting for Physical Property Estimation
3D Gaussian Splatting at Arbitrary Resolutions with Compact Proxy Anchors
AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction
GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction
Confidence-Guided Multi-Scale Aggregation for Sparse-View High-Resolution 3D Gaussian Splatting
Robust3DGSW: Toward Robust Watermarking for Quantization-Aware 3D Gaussian Splatting
ParkGaussian: Surround-view 3D Gaussian Splatting for Autonomous Parking
L^2DGS: Low-Light Dynamic Gaussian Splatting
DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models
Locate-Then-Examine: Grounded Region Reasoning Improves Detection of AI-Generated Images
HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation
Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token
Mario: Multimodal Graph Reasoning with Large Language Models
Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought
Mimic Human Cognition, Master Multi-Image Reasoning: A Meta-Action Framework for Enhanced Visual Understanding
OpenMMReasoner: Pushing the Frontiers in Multimodal Reasoning with an Open and General Recipe
SelecTKD: Selective Token-Weighted Knowledge Distillation for LLMs
AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception
Precise Object and Effect Removal with Adaptive Target-Aware Attention
Decompose, Mix, Adapt: A Unified Framework for Parameter-Efficient Neural Network Recombination and Compression
FreqSIC: Frequency-aware Stereo Image Compression with Bi-directional Checkerboard Context Model
A2GC: Asymmetric Aggregation with Geometric Constraints for Locally Aggregated Descriptors
Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs
MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding
Saliency-Guided Representation with Consistency Policy Learning for Visual Unsupervised Reinforcement Learning
DPGF-Net: Dual-Prior Guided Fusion Network for Joint Assessment of Perceptual Quality and Semantic Consistency in AI-Generated Images
RegionFuse: Region-Adaptive Pixel Distribution Learning for Infrared and Visible Image Fusion
Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared
TAPE: Task-Adaptive Prototype Evolution in Audio-Language Models for Fully Few-shot Class-incremental Audio Classification
DDSF: Robust Few-Shot Learning via Disentangled Subspaces with Determinantal Point Process
Hyperbolic Defect Feature Synthesis for Few-Shot Defect Classification
Balanced Dataset Distillation via Modeling Multiple Visual Pattern Distribution
Dataset Distillation by Influence Matching
StableMaterials: Enhancing Diversity in Material Generation via Semi-Supervised Learning
Seeing Through Blur: Tackling Defocus in Spike-Based Imaging
LightRR: A Lightweight Network for Single Image Reflection Removal
HFR and HDR Video from Multi-Attenuated Spikes Using a Rapidly Rotating SpokeND Filter
Coded-E2LF: Coded Aperture Light Field Imaging from Events
Solving Minimal Problems Without Matrix Inversion Using FFT-Based Interpolation
From Spots to Pixels: Dense Spatial Gene Expression Prediction from Histology Images
Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images
FE2E: From Editor to Dense Geometry Estimator
VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
NI-Tex: Non-isometric Image-based Garment Texture Generation
UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching
PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion
Lafite: A Generative Latent Field for 3D Native Texturing
Image-Guided Geometric Stylization of 3D Meshes
LATTICE: Democratize High-Fidelity 3D Generation at Scale
MeshMosaic: Scaling Artist Mesh Generation via Local-to-Global Assembly
TacSIm: A Dataset and Benchmark for Football Tactical Style Imitation
Fine-Grained GRPO for Precise Preference Alignment in Flow Models
Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals
Self-Corrected Image Generation with Explainable Latent Rewards
Reading Your Actions: Learning Generalizable Action Representations via Pre-training AEMG
MA-Bench: Towards Fine-grained Micro-Action Understanding
Action Motifs: Self-Supervised Hierarchical Representation of Human Body Movements
DarkShake-DVS: Event-based Human Action Recognition under Low-light and Shaking Camera Conditions
SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition
S^2FT: Parameter-Efficient Fine-Tuning in Sparse Spectrum Domain
Rethinking SNN Online Training and Deployment: Gradient-Coherent Learning via Hybrid-Driven LIF Model
TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts
AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks
Learning to Solve PDEs on Neural Shape Representations
GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents
Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection
Mitigating Simplicity Bias in OOD Detection through Object Co-occurrence Analysis
Open-Vocabulary Domain Generalization in Urban-Scene Segmentation
ALLNet: Multi-task Dense Prediction for Degraded Images
Geometry-Aware Cross-Modal Graph Alignment for Referring Segmentation in 3D Gaussian Splatting
GenMask: Adapting DiT for Segmentation via Direct Mask Generation
Frequency-Aware Affinity for Weakly Supervised Semantic Segmentation
Beyond Reassembly: Fractured Object Recovery with Missing Parts
RMAE-ProGRess: Advancing Semantic Segmentation in Unstructured Environments
Parallel Rigidity Matters for Bundle Adjustment
GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding
Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting
Improving Adversarial Transferability with Local Perturbation Augmentation
Stealing Split Learning Bottom Models by Recovering Embedding Geometry
Towards Reliable Evaluation of Adversarial Robustness for Spiking Neural Networks
Where, What, Why: Toward Explainable 3D-GS Watermarking
Adaptive Action Chunking at Inference-time for Vision-Language-Action Models
Localizing, Structuring, and Rendering: Bridging 3D and 2D Vision-Language-Action Models for Robotic Manipulation
EnergyAction: Unimanual to Bimanual Composition with Energy-Based Models
GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping
UETrack: A Unified and Efficient Framework for Single Object Tracking
Efficient Video Object Segmentation and Tracking with Recurrent Dynamic Submodel
Learning to Track Instance from Single Nature Language Description
Adaptive Depth Lightweight RGB-T Tracking with Holistic Token Routing
UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking
From Infusion to Assimilation Distillation for Medical Image Segmentation
IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
Unlocking Positive Transfer in Incrementally Learning Surgical Instruments: A Self-reflection Hierarchical Prompt Framework
Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code
MedLoc-R1: Performance-Aware Curriculum Reward Scheduling for GRPO-Based Medical Visual Grounding
Turning Pre-Trained Vision Transformers into End-to-End Histopathology Whole Slide Image Models for Survival Prediction
KAMP: Knowledge-Anchored Multimodal Pretraining Framework for Medical Image Representation
OmniFM: Toward Modality-Robust and Task-Agnostic Federated Learning for Heterogeneous Medical Imaging
EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR
PoseD-Flow: Versatile and Guided Flow Matching Model of Human Pose
HUMAPS-4D: A Multimodal Dataset for HUman Motion Analysis with Physiological and Semantic informations
LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World
Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation
OMGTex: One-stage Multi-style Facial Texture Reconstruction without Geometry Guidance
Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning
SAGA: Source Attribution of Generative AI Videos
Learning from Noisy Supervision: A Denoising–Debiasing Framework for Weakly Supervised Video Anomaly Detection
VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer
RAID: Retrieval-Augmented Anomaly Detection
QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy
GSV2X: Geometry-Aware Uncertainty Modeling and Orthogonal Fusion for Robust Roadside Perception
From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection
TGT: Text-Grounded Trajectories for Locally Controlled Video Generation
SoccerMaster: A Vision Foundation Model for Soccer Understanding
AceTone: Bridging Words and Colors for Conditional Image Grading
R^2-Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection
VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
Sparse–View Localization via Online Neural 3D Regression
Dynamic Visual SLAM using a General 3D Prior
FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention
No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency
Reliev3R: Relieving Feed-forward 3D Reconstruction from Multi-View Geometric Annotations
Global Structure-from-Motion Meets Feedforward Reconstruction
StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives
LumiX: Structured and Coherent Text-to-Intrinsic Generation
OmniGen2: Towards Instruction-Aligned Multimodal Generation
LoFA: Learning to Predict Personalized Prior for Fast Adaptation of Visual Generative Models
FlowFixer: Towards Detail-Preserving Subject-Driven Generation
PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models
MoBind: Motion Binding for Fine-Grained IMU–Video Pose Alignment
Tackling Model Bias via Game-theoretic Multi-agent Collaboration Framework for Hateful Meme Classification
Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
SG-LoRA: Semantic-guided LoRA Parameters Generation
AcTTA: Rethinking Test-Time Adaptation via Dynamic Activation
Reframing Long-Tailed Learning via Loss Landscape Geometry
DC-Merge: Improving Model Merging with Directional Consistency
NEC-Diff: Noise-Robust Event-RAW Complementary Diffusion for Seeing Motion in Extreme Darkness
Geometric-Photometric Event-based 3D Gaussian Ray Tracing
EventDrive: Event Cameras for Vision-Language Driving Intelligence
MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent
Resolving the Stability-Plasticity Dilemma in Reinforcement Learning via Complementary Continual Critics
Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation
FLARE: A Failure-Aware Framework for Autonomous Correction and Recovery in Visual-Language Robotic Manipulation
General Process Reward Modeling for Robotic Reinforcement Learning
DynBridge: Bridging Imagination and Control through Interaction Dynamics for Robot Manipulation
Action-Sketcher: From Reasoning to Action via Visual Sketches for Robotic Manipulation
Cycle-Consistent Tuning for Layered Image Decomposition
NEAF: Natural Image Editing with Attention Fusion for Generalizable Test-time Optimization in Text-Guided Image Editing
Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere
4DSurf: High-Fidelity Dynamic Scene Surface Reconstruction
Depth Peeling for High-Fidelity Gaussian-Enhanced Surfel Rendering
Intrinsic Image Fusion for Multi-View 3D Material Reconstruction
Hybrid Agents for Image Restoration
Self-supervised Dynamic Heterogeneous Degradation Modeling for Unified Zero-Shot Image Restoration
PhaSR: Generalized Image Shadow Removal with Physically Aligned Priors
Disentanglement-wise Image Dehazing through Cross-Domain Manifold Consensus
EchoPOSE: 6D Pose Estimation of Sparse Echocardiograms for Left-Ventricular 3D Shape Reconstruction
LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding
Harmonized Feature Conditioning and Frequency-Prompt Personalization for Multi-Rater Medical Segmentation
PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation
Test-Time Attention Purification for Backdoored Large Vision Language Models
Towards Robust Multimodal Large Language Models Against Jailbreak Attacks
When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models
Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models
MeteorPred: A Meteorological Multimodal Large Model and Dataset for Severe Weather Event Prediction
YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction
Modeling Cross-vision Synergy for Unified Large Vision Model
Beyond Missing Modalities: Hypergraph Conditioned Diffusion for Uncertainty-Aware Multimodal Emotion Recognition
MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding
AMusE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding
EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer’s Disease
Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration
Bootstrap Your Own AV-Proxies: Adaptive Contrastive and Prototype Learning for Audio-Visual Segmentation
Multimodal Distribution Matching for Vision-Language Dataset Distillation
M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG
Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface
Clay-to-Stone: Phase-wise 3D Gaussian Splatting for Monocular Articulated Hand-Object Manipulation Modeling
CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization
FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing
PoseAnything: General Pose-guided Video Generation with Part-aware Temporal Coherence
FastHybrid: Accelerating Hybrid Autoregressive Image Generation with Lookahead and Guided Decoding
AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation
EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation
Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions
Cross-Subject EEG-to-Video Reconstruction and Beyond
Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation
VABench: A Comprehensive Benchmark for Audio-Video Generation
DVAR: Dynamic Visual Autoregressive Modeling for Image Super-Resolution
Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers
LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation
EMR-Diff: Edge-aware Multimodal Residual Diffusion Model for Hyperspectral Image Super-resolution
FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution
HDW-SR: High-Frequency Guided Diffusion Model based on Wavelet Decomposition for Image Super-Resolution
Unifying Precise Keyframes and Semantic Control via Multi-level Diffusion
Pressure2Motion: Hierarchical Human Motion Reconstruction from Ground Pressure with Text Guidance
MoCoDiff: A Controllable Autoregressive Diffusion Model for Expressive Motion Generation
W2W: Language-Model-Based Trajectory Prediction with Reinforcement Learning
Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models
Unified Number-Free Text-to-Motion Generation Via Flow Matching
FlowPalm: Optical Flow Driven Non-Rigid Deformation for Geometrically Diverse Palmprint Generation
PhyOceanCast: Global Ocean Forecasting with Physics-Informed Diffusion
ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models
Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding
VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement
Efficient Encoder-Free Fourier-based 3D Large Multimodal Model
Socratic-Geo: Synthetic Data Generation and Cross-Modal Geometric Reasoning via Multi-Agent Interaction
ReLaGS: Relational Language Gaussian Splatting
3D-IDE: 3D Implicit Depth Emergent
LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map
Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval
PIX-TAB: Efficient PIXel-Precise TABle Structure Recognition Approach with Speculative Decoding and Region-Based Image Segmentation
Camouflage-aware Image-Text Retrieval via Expert Collaboration
TIGER: A Unified Framework for Time, Images and Geo-location Retrieval
Stitch-a-Demo: Creating Video Demonstrations from Multistep Descriptions
Prototypical Action Reasoning Facilitated by Vision-Language Alignment for Egocentric Action Anticipation
AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting
Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding
VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding
SVAgent: Storyline-guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
TRANSPORTER: Transferring Visual Semantics from VLM Manifolds
PointCNN++: Performant Convolution on Native Points
PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration
Test-Time Training for LiDAR Semantic Segmentation under Corruption via Geometric Inlier Discrimination
GEM: Generating LiDAR World Model via Deformable Mamba
Task-Driven Implicit Representations for Automated Design of LiDAR Systems
Hierarchical Point-Patch Fusion with Adaptive Patch Codebook for 3D Shape Anomaly Detection
Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design
Soft Modality-Guided Expert Specialization in MoE-VLMs
CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models
AutoRegressive Generation with B-rep Holistic Token Sequence Representation
Diagram2Structure: Unlocking LLMs' Diagram Comprehension through DiagramDiff, an Offline Diagram Structuring Framework
ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement
ShadowDraw: From Any Object to Shadow-Drawing Compositional Art
End-to-End Hyper-Relational Information Extraction for Engineering Diagrams via Dynamically Tokenized Relation Transformer
Dynamic Token Reweighting for Robust Vision-Language Models
COPYLENS: Towards Copyrighted Characters Infringement Detection via Copyright-Aware Prompt Learning
Closed-Form Concept Erasure via Double Projections
Federated Active Learning Under Extreme Non-IID and Global Class Imbalance
FedRG: Unleashing the Representation Geometry for Federated Learning with Noisy Clients
Fully Decentralized Certified Unlearning
Towards Streaming Referring Video Segmentation via Large Language Model
OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding
VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding
DeRVOS: Decoupling Consistent Trajectory Generation and Multimodal Understanding for Referring Video Object Segmentation
UniCompress: Token Compression for Unified Vision–Language Understanding and Generation
VLM-PTQ: Efficient Post-Training Quantization for Large Vision-Language Models
Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization
Prototype-based Causal Intervention for Multi-Label Image Classification
Evidential Deep Partial Label Learning to Quantify Disambiguation Uncertainty
Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods
PAF: Perturbation-Aware Filtering for Open-Set Semi-Supervised Learning
Global-Graph Guided and Local-Graph Weighted Contrastive Learning for Unified Clustering on Incomplete and Noise Multi-View Data
Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images
Perceiving the Near, Reasoning the Distant: Coherent Long-Horizon Trajectory Prediction for Autonomous Driving
Dual-Agent Reinforcement Learning for Adaptive and Cost-Aware Visual–Inertial Odometry
MARIS: Marine Open-Vocabulary Instance Segmentation
XSeg: A Large-scale X-ray Contraband Segmentation Benchmark For Real-World Security Screening
Training-Free Open-Vocabulary Camouflaged Object Segmentation via Fine-Grained Object Binding and Adaptive Hybrid Prompt
ReAttnCLIP: Training-Free Open-Vocabulary Remote Sensing Image Segmentation via Re-defined Attention in CLIP
Mixture of Prototypes for Test-time Adaptive Segmentation
Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning
Beyond the Static World: Continual Category Discovery under Visual Drift
Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation
BeautyGRPO: Aesthetic Alignment for Face Retouching via Dynamic Path Guidance and Fine-Grained Preference Modeling
PerformRecast: Expression and Head Pose Disentanglement for Portrait Video Editing
PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation
FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction
DriveVLN: Towards Mapless Vision-and-Language Navigation in Autonomous Driving
Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning
CGHair: Compact Gaussian Hair Reconstruction with Card Clustering
RelightAnyone: A Generalized Relightable 3D Gaussian Head Model
Feed-forward Gaussian Registration for Head Avatar Creation and Editing
Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction
PerpetualWonder: Long-horizon Action-conditioned 4D Scene Generation
RecEdit-Drive: 3D Reconstruction-Guided Spatiotemporal Video Editing for Autonomous Driving Scenes
RigMo: Unifying Rig and Motion Learning for Generative Animation
Text-guided Feature Disentanglement for Cross-modal Gait Recognition
Portable Active Learning for Object Detection
Efficiency Follows Global-Local Decoupling
VRCLIP: Multimodal Canonical Correlation Alignment for CLIP-Driven Vision-Radio Person Re-Identification
Expert-Teacher-Student Collaborative Learning for Domain Adaptive Object Detection
CI-VID: A Coherent Interleaved Text-Video Dataset
GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding
Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning
Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos
UniVBench: Towards Unified Evaluation for Video Foundation Models
NAMI: Efficient Image Generation via Bridged Progressive Rectified Flow Transformers
TimeRipples: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space
DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression
Efficient and High-Fidelity Omni Modality Retrieval
FlowDC: Flow-Based Decoupling-Decay for Complex Image Editing
High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning
Spe-BEVHead: Rethinking the Detection Head Design for Bird’s-Eye-View Object Detection
UI-Lens: Assessing General MLLMs’ Potential to Automate UI Display Quality Assurance
Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation
Linking Perception, Confidence and Accuracy in MLLMs
Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
From Pixel to Precision: Enhancing Handwritten Mathematical Expression Recognition with Image-Level Reward
Seele: A Unified Acceleration Framework for Real-Time Gaussian Splatting on Mobile Devices
GHPT: Real-Time Relightable Gaussian Splatting using Hybrid Path Tracing
PolarGuide-GSDR: 3D Gaussian Splatting Driven by Polarization Priors and Deferred Reflection for Real-World Reflective Scenes
SGS-Intrinsic: Semantic-Invariant Gaussian Splatting for Sparse-View Indoor Inverse Rendering
3D Gaussian Splatting with Self-Constrained Priors for High Fidelity Surface Reconstruction
TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting
VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM
FastGS: Training 3D Gaussian Splatting in 100 Seconds
BA-GS: Bayesian Adaptive Gaussian Splatting for SFM-Free 3D Reconstruction
ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking
PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought
Can a Second-View Image Be a Language? Geometric and Semantic Cross-Modal Reasoning for X-ray Prohibited Item Detection
VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging
VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation
UVU: Improving Multimodal Understanding via Vision-Language Unified Autoregressive Paradigm
OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning
Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts
ApET: Approximation-Error Guided Token Compression for Efficient VLMs
Vision Transformers Need More Than Registers
AdaSVD: Singular Value Decomposition with Adaptive Mechanisms for Large Multimodal Models
ReFTA: Breaking the Weight Reconstruction Bottleneck in Tensorized Parameter-Efficient Fine-Tuning
PlanaReLoc: Camera Relocalization in 3D Planar Primitives via Region-Based Structure Matching
MOGeo: Beyond One-to-One Cross-View Object Geo-localization
AsymLoc: Towards Asymmetric Feature Matching for Efficient Visual Localization
Asking like Socrates: Socrates helps VLMs understand remote sensing images
GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training
Let VLMs Grade Their Own Thoughts: A Self-Quantification Approach to Reasoning-Aware Reward Modeling
SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System
VideoSSR: Video Self-Supervised Reinforcement Learning
Neurodynamics-Driven Coupled Neural P Systems for Multi-Focus Image Fusion
MagicFuse: Single Image Fusion for Visual and Semantic Reinforcement
Human-Centric Multi-Exposure Fusion: Benchmark and Bi-level Cognition Distillation Framework
NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering
Unsafe2Safe: Controllable Image Anonymization for Downstream Utility
Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation
Streamlined Knowledge Distillation
IMS3: Breaking Distributional Aggregation in Diffusion-Based Dataset Distillation
240FPS Stereo Vision from Monocular Mixed Spikes
Self-Diffusion Driven Blind Imaging
Differentiable Stroke Planning with Dual Parameterization for Efficient and High-Fidelity Painting Creation
Towards Universal Computational Aberration Correction in Photographic Cameras: A Comprehensive Benchmark Analysis
Multi-View Hierarchical Alignment Learning for Spatial Transcriptomics
TRIDENT: A Trimodal Cascade Generative Framework for Drug and RNA-Conditioned Cellular Morphology Synthesis
OrienPose: Orientation-Guided Novel View Synthesis for Single-Image Unseen Object Pose Estimation
Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
Seeing Depth Through Frequency and Motion: A Progressive Training Paradigm for Monocular Depth Estimation
GeoGuide: Hierarchical Geometric Guidance for Open-Vocabulary 3D Semantic Segmentation
PE3R: Perception-Efficient 3D Reconstruction
Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation
AirSim360: A Panoramic Simulation Platform within Drone View
Radar-Guided Polynomial Fitting for Metric Depth Estimation
REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement
MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing
FlashMesh: Faster and Better Autoregressive Mesh Synthesis via Structured Speculation
X-Part: High Fidelity And Structure Coherent Shape Decomposition And Completion
Nestwork: Conditional 3D Furnished House Layout Generation through Latent Heterogeneous Graph Diffusion
Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D
ExMesh: EXplicit Mesh Reconstruction with Topology Adaptation
3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience
Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers
PhysGen: Physically Grounded 3D Shape Generation for Industrial Design
Enhancing Spatial Understanding in Image Generation via Reward Modeling
Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization
JarvisEvo: Towards a Self-Evolving Photo Editing Agent with Synergistic Editor-Evaluator Optimization
LAOF: Robust Latent Action Learning with Optical Flow Constraints
DarkAct: A RGB-Thermal Dataset and Fusion Framework for Multimodal Low-Light Action Recognition
Deep Feature Deformation Weights
Resolving Endpoint Underfitting in Diffusion Bridges via Noise Alignment
RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models
Coupling Liquid Time‑Constant Encoders with Modern Hopfield Memory
Stronger Normalization-Free Transformers
Convolutional Neural Networks Driven by Content Similarity
HATS: Hardness-Aware Trajectory Synthesis for GUI Agents
ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence On Mobile Devices
OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models
See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles
Beyond Weak Supervision: MLLMs-Guided Graded Knowledge Distillation for Unsupervised Camouflaged Object Detection
TTL: Test-time Textual Learning for OOD Detection with Pretrained Vision-Language Models
SRA-Det: Learning Omni-Grained Open-Vocabulary Detection Beyond Category Names
PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation
Seeing Both Sides: Towards Bidirectional Semantic Alignment for Open-Vocabulary Camouflaged Object Segmentation
REL-SF4PASS: Panoramic Semantic Segmentation with REL Depth Representation and Spherical Fusion
From Softmax to Dirichlet: Evidential Learning for Semi-supervised Semantic Segmentation
Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild
HOPS: Hierarchical Open-vocabulary Part Segmentation with Attention-Aware Filtering and Affinity-Guided Enhancement
MEMO: Human-like Crisp Edge Detection Using Masked Edge Prediction
Seeing Through the Noise: Improving Infrared Small Target Detection and Segmentation from Noise Suppression Perspective
GeoSANE: Learning Geospatial Representations from Models, Not Data
Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation
RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation
SEBA: Sample-Efficient Black-Box Attacks on Visual Reinforcement Learning
Phantom: Physical Object Interactions as Dynamic Triggers for NMS-Exploited Backdoors
Defending Unauthorized Model Merging via Dual-Stage Weight Protection
On the Role of Temporal Granularity in the Robustness of Spiking Neural Networks
LacTokGen: Latent Consistency Tokenizer for 1024-pixel Image Generation by 256 Tokens
Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics for Robot Manipulation
TSTM: Temporal Segmentation for Task-relevant Mask in Visual Reinforcement Learning Generalization
GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion
A Cross-view Fusion Framework for Robust 6-DoF Grasp Pose Estimation
SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion
Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3-D Constrained Terrains
IGen: Scalable Data Generation for Robot Learning from Open-World Images
TGTrack: Temporal Generative Learning for Unified Single Object Tracking
GeoMotion: Rethinking Motion Segmentation via Latent 4D Geometry
RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation
Dual-level Adaptation for Multi-Object Tracking: Building Test-Time Calibration from Experience and Intuition
GMT: Effective Global Framework for Multi-Target Multi-Camera Tracking
DARC: Dual Adjustment Reasoning with Counterfactuals for Trustworthy Chest X-ray Classification
Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis
Temporal Inversion for Learning Interval Change in Chest X-Rays
JUMP-Hand: Learning Joint-wise Uncertainty to Gate Mixture of View Experts for Multi-View 3D Hand Reconstruction
PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery
Anatomical Domain Shifts: Test-time Heterogeneous Adaptation for 3D Human Pose Prediction
Learning 3D Shape Fidelity Metric from Real-world Distortions
BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition
FisherPoser: Human Motion Estimation from Sparse Observations with Hierarchical Region-Wise Fisher-Matrix Uncertainty Modeling
VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network
Bringing Your Portrait to 3D Presence
One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control
UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression
Multi-Prototype Compactness and Boundary-Aware Synthesis for Unsupervised Anomaly Detection
PDD: Manifold-Prior Diverse Distillation for Medical Anomaly Detection
Learning Spatial-Temporal Consistency for 3D Semantic Scene Completion
OccAny: Generalized Unconstrained Urban 3D Occupancy
Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving
ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation
Global-Aware Edge Prioritization for Pose Graph Initialization
Optical Flow Matching: Reframing Optical Flow as Continuous Transport Dynamics
U^2Flow: Uncertainty-Aware Unsupervised Optical Flow Estimation
Confusion-Aware Spectral Regularizer for Long-Tailed Recognition
Learning Latent Concepts for Detecting Out-of-Distribution Objects
Learning Like Humans: Analogical Concept Learning for Generalized Category Discovery
Understanding Task Transfer in Vision-Language Models
ImmerIris: A Large-Scale Dataset and Benchmark for Off-Axis and Unconstrained Iris Recognition in Immersive Applications
OpenDance: Multimodal Controllable 3D Dance Generation with Large-scale Internet Data
POLAR: A Portrait OLAT Dataset and Generative Framework for Illumination-Aware Face Modeling
Scaling View Synthesis Transformers
WildPose: A Unified Framework for Robust Pose Estimation in the Wild
MoRe: Motion-aware Feed-forward 4D Reconstruction Transformer
Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis
KV-Tracker: Real-Time Pose Tracking with Transformers
DualPrim: Compact 3D Reconstruction with Positive and Negative Primitives
Agentic Retoucher for Text-To-Image Generation
Paper2Figure: A Multi-Agent Collaborative System for Figure Generation Towards Academic Research Paper
Rethinking Glyph Spatial Information in Font Generation
ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control
Aligning Multi-Character Narrative Image Generation with Multi-Aspect Human Preferences
FoleyDirector: Directing Temporal Controllable Video-to-Audio Generation via Fine-Grained Temporal Scripts
DCoAR: Deep Concept Injection into Unified Autoregressive Models for Personalized Text-to-Image Generation
AutoDebias: An Automated Framework for Detecting and Mitigating Backdoor Biases in Text-to-Image Models
PosterIQ: A Design Perspective Benchmark for Poster Understanding and Generation
IVAAN: Instance-level Vision-Language Alignment via Attribute-Guided Text Prompts Generation for Nuclei Analysis
Boosting Visual Reprogramming for CLIP with Dual Granularity Alignment
Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning
UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in RL
Label What Matters: Modality-Balanced and Difficulty-Aware Multimodal Active Learning
Unified Personalized Understanding, Generating and Editing
Decision Boundary-aware Generation for Long-tailed Learning
Towards Stable Federated Continual Test-Time Adaptation in Wild World
ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation
Addressing Exacerbated Attention Sink for Source-Free Cross-Domain Few-Shot Learning
Depth Hypothesis Guided Iterative Refinement for Event–Image Monocular Depth Estimation
Tracking through Severe Occlusion via Event-Derived Transient Cues
FastEventDGS: Deformable Gaussian Splatting for Fast Dynamic Scenes from a Single Event Camera
From Corners to Fiducial Tags: Revisiting Checkerboard Calibration for Event Cameras
Extending Embodied Question Answering from Perception to Decision
MaskDexGrasp: Generative Masked Modeling for Part-Aware Dexterous Grasp Synthesis
Predict Before You Explore: Predictive Planning with Specialized Memory for Embodied Question Answering
VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents
FlowSteer: Guiding Few-Step Image Synthesis with Authentic Trajectories
Rethinking Intermediate Representation for VLM-based Robot Manipulation
MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition
Harmonic Canvas: Inversion-Free Editing for Visually-Guided Music Style Transfer
SCIEval: Evaluating and Benchmarking the Faithfulness of Scientific Image Generation and Interpretation with Large Multimodal Models
Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation
ReFlow: Self-correction Motion Learning for Dynamic Scene Reconstruction
NVGS: Neural Visibility for Occlusion Culling in 3D Gaussian Splatting
NeAR: Coupled Neural Asset–Renderer Stack
Life-IQA: Boosting Blind Image Quality Assessment through GCN-enhanced Layer Interaction and MoE-based Feature Decoupling
Multinex: Lightweight Low-light Image Enhancement via Multi-prior Retinex
Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration
ExpoCM: Exposure-Aware One-Step Generative Single-Image HDR Reconstruction
UCMNet: Uncertainty-Aware Context Memory Network for Under-Display Camera Image Restoration
Bilevel Layer-Positioning LoRA for Real Image Dehazing
SD-FSMIS: Adapting Stable Diffusion for Few-Shot Medical Image Segmentation
SHAPE: Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation for Medical Image Segmentation
Delving Aleatoric Uncertainty in Medical Image Segmentation via Vision Foundation Models
Focus on Background: Exploring SAM's Potential in Few-shot Medical Image Segmentation with Background-centric Prompting
Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models
TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models
Jailbreaking Vision-Language Models via Dissonance-Guided Suffix Optimization and Image–Phrase Injection
Transform to Transfer: Boosting Adversarial Attack Transferability on Vision-Language Pre-training Models
Reliable Clustering Number Estimation for Contrastive Multi-View Clustering
Enhance-then-Balance Modality Collaboration for Robust Multimodal Sentiment Analysis
MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
EXOTIC: External Vision-driven Incomplete Multi-view Classification
OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation
UniT: Unified Multimodal Chain-of-Thought Test-time Scaling
Multi-modal Test-time Adaptation via Adaptive Probabilistic Gaussian Calibration
Information-Theoretic Decomposition for Multimodal Interaction Learning
MUST: Modality-Specific Representation-Aware Transformer for Diffusion-Enhanced Survival Prediction with Missing Modality
Visual Autoregressive Modeling via Next Focus Prediction
TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction
SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation
Mixture of Style Experts for Diverse Image Stylization
Mirai: Autoregressive Visual Generation Needs Foresight
Bridging the Perception Gap in Image Super-Resolution Evaluation
Restore Text First, Enhance Image Later: Two-Stage Scene Text Image Super-Resolution with Glyph Structure Guidance
Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction
Omni-Supervised Motion Editing: Balancing Change and Invariance through Positive-Negative Learning
FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning
Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing
Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models
PAMotion: Physics-Aware Motion Generation for Full-Body Interaction with Multiple Objects
CLEP: Contrastive Language-Pose Pretraining
ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation
Beyond Mimicry: Learning Whole-Body Human-Humanoid Interaction from Human-Human Demonstrations
PHAC: Promptable Human Amodal Completion
IntrinsicWeather: Controllable Weather Editing in Intrinsic Space
Outlier-Robust Diffusion Solvers for Inverse Problems
Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models
ReasonX: MLLM-Guided Intrinsic Image Decomposition
KLIP: Localized Distribution Shift Detection via KL-Divergence with Diffusion Priors in Inverse Problems
Taming Generative Diffusion Model for Task-Oriented Infrared Imaging
RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning
DiGraphHal-Bench: Evaluating Multimodal Large Language Models on Complex Directed Graphs
SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia
E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought
GeoWorld: Geometric World Models
MonoVLM: Monocular 3D Visual Grounding with Vision Language Models
Curvature-Aware Captioning: Leveraging Geodesic Attention for 3D Scene Understanding
SPREAD: Spatial-Physical REasoning via geometry Aware Diffusion
Merge3D: Efficient 3D Multimodal LLMs via Joint 2D-3D Token Merging
Quota-Calibrated Fine-Grained Alignment with Context-Aware Marginals for Text-based Person Retrieval
Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval
Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models
Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval
MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures
Progressive Cross-Modal Causal Intervention for Long-Term Action Recognition
EthoCLIP: Ontology-Enhanced Video-Language Pretraining for Animal Behavior Understanding
VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer
Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos
See Through the Noise: Improving Domain Generalization in Gaze Estimation
From Feature Learning to Spectral Basis Learning: A Unifying and Flexible Framework for Efficient and Robust Shape Matching
Topology-aware Feature Propagation for Unsupervised Non-rigid Point Cloud Correspondence
SAG-GNN: Semantic-Aware Guided GNN for Descriptor-Free 2D-3D Matching
LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration
GM-R^2: Generative Matching Learning for Unsupervised Geometric Representation and Registration
MORE-STEM: Long-Short MemOry REcall and Spatio-TEmporal Consistency Model for Query-Driven 3D/4D Point Cloud Segmentation
Low-Rank Test-Time Training for Pre-Trained Point Cloud Models
DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles
SynCLIP: Synonym-Coherent Language-Image Pretraining for Robust Open-Vocabulary Dense Perception
VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models
ORION: ORthonormal Text Encoding for Universal VLM AdaptatION
Mirror Illusion Art
Towards Human-Like Robot Handwriting via Contour-Aware Generation
MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts
VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation
LottieGPT: Tokenizing Vector Animation for Autoregressive Generation
Rank-Guided Pseudo-Bias Learning for Robust Black-Box Adaptation
WaTeRFlow: Watermark Temporal Robustness via Flow Consistency
SineProject: Machine Unlearning for Stable Vision-Language Alignment
HiLoRA: Hierarchical Low-Rank Adaptation for Personalized Federated Learning
Guiding Diffusion Models with Fine-Grained Conditions and Semantics-Preserving Sampling for One-Shot Federated Learning
FedRAC: Rolling Submodel Allocation for Collaborative Fairness in Federated Learning
Understanding Temporal Logic Consistency in Video-Language Models through Cross-Modal Attention Discriminability
Small Object, Great Challenge: A Benchmark for Small Object Visual Grounding
ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding
HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos
Hybrid Token Compression for Vision-Language Models
When Token Pruning is Worse than Random: Understanding Visual Token Information in VLLMs
VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions
Heterogeneous Decentralized Diffusion Models
TAR: Token-Aware Refinement for Fine-grained Generalized Category Discovery
Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning
Bypassing the Transport Plan: Dynamic Reweighting for Out-of-Distribution Detection with Optimal Transport
Debiased Sample Selection for Learning with Noisy Labels
Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles
Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving
WhisperNet: A Scalable Solution for Bandwidth-Efficient Collaboration
Generalizable Co-Salient Object Detection via Mixed Content-Style Modulation
Mixture-of-Experts based Feature Decoupling for Open Vocabulary Scene Graph Generation
Denoise and Align: Towards Source-Free UDA for Robust Panoramic Semantic Segmentation
SPOT: Spatiotemporal Prompt Optimization for Motion-Stabilized MLLM-Guided Video Segmentation
Changes in Real Time: Online Scene Change Detection with Multi-View Fusion
DGS: Dual Gradient and Semantic-Shift Guided Low-Rank Adaptation for Class Incremental Learning
Dynamic Magic: Unleashing Restricted Knowledge for Lifelong Person Re-Identification
Temporal Imbalance of Positive and Negative Supervision in Class-Incremental Learning
Dance Across Shifts: Forward-Facilitation Continual Test-Time Adaptation through Dynamic Style Bridging
High-Fidelity Mobile Avatars with Pruned Local Blendshapes
PhysSkin: Real-Time and Generalizable Physics-Based Animation via Self-Supervised Neural Skinning
Bridging Privacy and Provenance: Traceable Virtual Identity Generation
Dynamic Label Noise Suppression with Optimal Teacher Pool for Facial Expression Recognition
DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation
NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction
LookasideVLN: Direction-Aware Aerial Vision-and-Language Navigation
FreeForm: Reduced-Order Deformable Simulation from Particle-Based Skinning Eigenmodes
4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video
ZINA: Multimodal Fine-grained Hallucination Detection and Editing
PhysHO: Physics-Based Dynamic 3D Gaussian Human and Object from Monocular Video
Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection
KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing
Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention
ExPose: Reinforcing Video Generation Models for Extreme Pose Estimation
SounDiT: Geo-Contextual Soundscape-to-Landscape Generation
CamDirector: Towards Long-Term Coherent Video Trajectory Editing
Decoupling Bias, Aligning Distributions: Synergistic Fairness Optimization for Deepfake Detection
BDNet:Bio-Inspired Dual-Backbone Small Object Detection Network
RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework
URICA: A Uniformity Region Affine Identifier Capture Algorithm for Arbitrary Region Retrieval in Pathology Images
DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video
Video-CoE: Reinforcing Video Event Prediction via Chain of Events
VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding
CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning
Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search
LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks
Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer
YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal
FG-Portrait: 3D Flow Guided Editable Portrait Animation
IP-Adapter Is All You Need: Towards Fine-Tuning-Free Diffusion-Based Talking Face Generation
InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation
DSERT-RoLL: Robust Multi-Modal Perception for Diverse Driving Conditions with Stereo Event-RGB-Thermal Cameras, 4D Radar, and Dual-LiDAR
ReManNet: A Riemannian Manifold Network for Monocular 3D Lane Detection
STUR3D: Spatio-Temporal Unified Representation Learning for 3D Object Detection
Exploring 6D Object Pose Estimation with Deformation
PhysInOne: Visual Physics Learning and Reasoning in One Suite
AviaSafe: A Physics-Informed Data-Driven Model for Aviation Safety–Critical Cloud Forecasts
Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR
CLIP Is Shortsighted: Paying Attention Beyond the First Sentence
TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition
Urban-GS: A Unified 3D Gaussian Splatting Framework for Compact and High-Fidelity Aerial-to-Street Reconstruction
RemedyGS: Defend 3D Gaussian Splatting Against Computation Cost Attacks
SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras
GS^2: Graph-based Spatial Distribution Optimization for Compact 3D Gaussian Splatting
OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting
MLLMSplat: A 2D MLLM-Powered Framework for 3D Gaussian Splatting Understanding, Generation, and Editing
Dropping Anchor and Spherical Harmonics for Sparse-view Gaussian Splatting
RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing
PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting
Scene Grounding in the Wild
Revisiting 3D Reconstruction Kernels as Low-Pass Filters
VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension
Fuel Gauge: Estimating Chain-of-Thought Length Ahead of Time in Large Multimodal Models
Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs
See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens
CodePercept: Code-Grounded Visual STEM Perception for MLLMs
TableMix: Enhancing Multimodal Table Reasoning in MLLMs from a Data-Centric Perspective
Grounded Chain-of-Thought for Multimodal Large Language Models
SegMo: Co-Designing Content-Aware Sparsity and Locally-Cohesive Segment Parallelism for Efficient VLM Inference
Compressed-Domain-Aware Online Video Super-Resolution
Is Bin Generation Indispensable? A Bin-Generation-Free Dataset Quantization via Semantic Perspective
Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates
Enhancing Video Vision Language Model with Hippocampal Sensing
VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation
SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs
EfficientVPR: Toward Efficient Visual Place Recognition via Scene-Aware Prompt Tuning and Adaptive Feature Enhancement
Universal Guideline-Driven Image Clustering via a Hybrid LLM Agent
VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning
Think, Then Verify: A Hypothesis–Verification Multi-Agent Framework for Long Video Understanding
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
ReCoFuse: Ultra-Robust Image Fusion via Restorative Multi-Modal Diffusion Reciprocal Coupling
Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios
DF^2-VB: Dual-level Fuzzy Fusion with View-specific Boosting for Multi-view Multi-label Classification
Self-guided Semantic Inspection for Zero-Shot Composed Image Retrieval
G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval
MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification
Graph Attention Prototypical Network for Robust Few-Shot Classification
EVLF: Early Vision-Language Fusion for Generative Dataset Distillation
Flow Map Distillation Without Data
A Bit is All You Need! Efficient Video Capture via Single Bit Imaging
Physics-Guided Multistep Deformation Reversal for Ancient Bamboo Slip Restoration
SGDE: Self-supervised Geometry Degradation Estimation Framework for Coded Aperture Compressive Spectral Imaging
Depth Any Endoscopy: Towards Self-Supervised Generalizable Depth Estimation in Monocular Endoscopy
AdaSFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments
Dark3R: Learning Structure from Motion in the Dark
What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching?
Iris: Integrating Language into Diffusion-based Monocular Depth Estimation
Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos
UniPart: Part-Level 3D Generation with Unified 3D Geom–Seg Latents
Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation
Order Matters: 3D Shape Generation from Sequential VR Sketches
Think-Then-Generate: Structural Chain-of-Thought Reasoning for Consistent 3D Generation
FabricGen: Microstructure-Aware Woven Fabric Generation
Leveraging Verifier-Based Reinforcement Learning in Image Editing
VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
C^2FG: Control Classifier-Free Guidance via Score Discrepancy Analysis
Unified Customized Generation by Disentangled Reward Modeling
Region-Aware Instance Consistency Learning for Micro-Expression Recognition
LaDy: Lagrangian-Dynamic Informed Network for Skeleton-based Action Segmentation via Spatial-Temporal Modulation
Unlocking Pre-trained Weights: Parameter Inheritance for Zero-Shot Initialization
Progressive Neural Architecture Generation
When Do Models Actually Decide? Mapping the Layer-Wise Decision Timeline in Pretrained Neural Networks
Consensus vs. Controversy: Mapping the Decision Space Where Architectures Diverge
ProSoftArena: Benchmarking Hierarchical Capabilities of Multi-modal Agents in Professional Software Environments
DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding
Geometry-driven OOD Detectors Are Class-Incremental Learners
Prompt-Free Unknown Label Generation for Open World Detection in Remote Sensing
Learning to Diversify and Focus: A Reinforcement Framework for Open-Vocabulary HOI Detection
DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval
Leveraging Class Distributions in CLIP for Weakly Supervised Semantic Segmentation
D2Dewarp: Dual Dimensions Geometric Representation Learning Based Document Image Dewarping
Fast Reasoning Segmentation for Images and Videos
FireScope: Wildfire Risk Raster Prediction With a Chain-of-Thought Oracle
TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis
Regulating Rather than Constraining: Adaptive Guidance for Complex Spectral Reconstruction in Pansharpening
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
Revisiting the Necessity of Full Accuracy: Weakly Supervised Object-Level Offset Correction for Misaligned Building Labels
ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks
Taming the Long Tail: Rebalancing Adversarial Training via Adaptive Perturbation
Logit-Margin Repulsion for Backdoor Defense
Towards Stealthy and Effective Backdoor Attacks on Lane Detection: A Naturalistic Data Poisoning Approach
Red-teaming Retrieval-Augmented Diffusion Models via Poisoning Knowledge Bases
EgoRoC: Towards Egocentric Robotic Control via Task-Agnostic Visual Alignment
Describe Anything Anywhere At Any Moment
VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling
Action–Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation
HQC-NBV: A Hybrid Quantum-Classical View Planning Approach
MM-ACT: Learn from Multimodal Parallel Generation to Act
SE(3)-Equivariance with Geometric and Topological Guidance for Category-Level Object Pose Estimation
Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation
Progressive Multi-cue Alignment for Unaligned RGBT Tracking
Adapting Lightweight Image-based Counting Models for Video Crowd Counting
Sparse Task Vector Mixup with Hypernetworks for Efficient Knowledge Transfer in Whole-Slide Image Prognosis
GenTract: Generative Global Tractography
LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol
Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?
IEBGL:An Interpretability-Enhanced Brain Graph Learning Framework with LLM-Instructed Topology and Literature-Augmented Semantics
F^2-Assist: Multi-Phase Fetal Growth Forecast and Report Generation from Ultrasound Examination
Structural–Semantic Perception for Diffusion-Guided Temporal Forgery Localization
IncreFA: Breaking the Static Wall of Generative Model Attribution
AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs
Detecting Compressed AI-Generated Images via Phase Spectrum Robustness
DeepfakeImpact: A Two-Stage Benchmark with Real-World Impact in Deepfake Detection
Enhancing the Security of Visual Speaker Authentication Based on Dynamic Lip-Print Analysis
Editprint: General Digital Image Forensics via Editing Fingerprint with Self-Augmentation Training
Goldilocks Test Sets for Face Verification
DLVP-CLIP: Enhancing Fine-Grained Zero-Shot Anomaly Detection via Dynamic Local Visual Prompting
MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection
AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors
TLMA: Mitigating the Impact of Weakly Labeled Information for Video Anomaly Detection
Anomaly-Related Residual Fields for Cross-domain Anomaly Detection
No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection
FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement
DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving
Test-Time 3D Occupancy Prediction
Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration
RegionRoute: Regional Style Transfer with Diffusion Model
dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models
Low-Rank Residual Diffusion Models
TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration
Guiding Token-Sparse Diffusion Models
Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep
High-Fidelity Virtual Try-On beyond Paired Data Scarcity via Diffusion-based Cycle-Consistent Learning
See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis
Scale Space Diffusion
Making Training-Free Diffusion Segmentors Scale with the Generative Power
Few-Step Diffusion Sampling Through Instance-Aware Discretizations
SpeeDiff: Scalable Pixel-Anchored End-to-End Latent Diffusion Model
Structure-to-Intensity Diffusion for Adverse-Weather LiDAR Generation
Focal–General Diffusion Model with Semantic Consistent Guidance for Sign Language Production
PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion
Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation
MotionV2V: Editing Motion in a Video
Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
DreamStyle: A Unified Framework for Video Stylization
Cross-modal Representation Learning for Diffusion-generated Image Detection
CaricHarmony: Contrastive Diffusion Paths for Identity-Preserving Caricature Synthesis
RAPID: Reusing Attention Sparsity with Inter-step Adaptation for Efficient Video Diffusion
Learning Convex Decomposition via Feature Fields
Mapping Networks
SDTrack: A Baseline for Event-based Tracking via Spiking Neural Networks
DK-DDIL: Adaptive Knowledge Retention for Dynamic Domain-Incremental Learning in Medical Imaging
Dual-level Adapter Boosting Prompt-free Curvilinear Structure Segmentation
LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs
SegMoTE: Token-Level Mixture of Experts for Medical Image Segmentation
FedAdamom: Adaptive Momentum for Improved Generalization in Federated Optimization
SimScale: Learning to Drive via Real-World Simulation at Scale
Texvent: Asynchronous Event Data Simulation via Text Prompt
Free-Grained Hierarchical Visual Recognition
LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging
LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction
DROID-SLAM in the Wild
HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT
Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences
PointTPA: Dynamic Network Parameter Adaptation for 3D Scene Understanding
Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
DiffGraph: An Automated Agent-driven Model Merging Framework for In-the-Wild Text-to-Image Generation
Gloria: Consistent Character Video Generation via Content Anchors
M4V: Multimodal Mamba for Efficient Text-to-Video Generation
Mixture of States: Routing Token-Level Dynamics for Multimodal Generation
HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning
TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation
CoV-Align: Efficient Fine-grained Cross-Modal Alignment with Cohesive Visual Semantics Priority
TDATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment
PhyCritic: Multimodal Critic Models for Physical AI
Why Not Hyperparameter-Friendly Optimisation? A Monotonic Adaptive Norm Rescaling Approach For Long-Tailed Recognition
MemFlow: A Lightweight Forward Memorizing Framework for Quick Domain Adaptive Feature Mapping
Vision-Language Model Guided Source-Free Domain Adaptation via Optimal Transport
TaskIT: Memory-Efficient Fine-Tuning of Multi-LoRA LLMs via Cross-Task Importance Transfer
ARES: Unifying Asymmetric RGB-Event Stereo for Probabilistic Scene Flow Estimation
Moving Border Ownership for Event-based Motion Segmentation
TTAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events
Seeing Motion Through Polarity for Event-based Action Recognition
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size
Experience Transfer for Multimodal LLM Agents in Minecraft Game
MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation
SaPaVe: Towards Active Perception and Manipulation in Vision-Language Action Models for Robotics
ForeAct: Steering Your VLA with Efficient Visual Foresight Planning
MERIT: Multi-domain Efficient RAW Image Translation
Probabilistic Prompt Adaptation for Unified Image Aesthetics and Quality Assessment
Inter-Edit: First Benchmark for Interactive Instruction-Based Image Editing
PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis
MVInverse: Feed-forward Multiview Inverse Rendering in Seconds
MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectioanl Blending with Hierarchical Densification
Multi-view Pyramid Transformer: Look Coarser to See Broader
RL‑ScanIQA: Reinforcement-Learned Scanpaths for Blind 360° Image Quality Assessment
Benchmarking Endoscopic Surgical Image Restoration and Beyond
UniLDiff: Unlocking the Power of Diffusion Priors for All-in-One Image Restoration
MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator
Learning Where to Look and How to Judge: Resolution-agnostic Image Quality Assessment with Quality-aware Saliency
Bridging RGB and Hematoxylin Components: An Interleaved Guidance and Fusion Framework for Point Supervised Nuclei Segmentation
VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
Photo-Guided Tooth Segmentation on 3D Oral Scan Model
Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation
SafeLogo: Turning Your Logos into Jailbreak Shields via Micro-Regional Adversarial Training
Beyond Text Prompts: Precise Concept Erasure through Text–Image Collaboration
ReMoE: Region-Mixture Experts for Adversarially-Robust Vision Transformers
TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration
Chain-of-Thought Guided Multi-Modal Object Re-Identification
Parameter-Efficient Adaptation for MLLMs via Implicit Modality Decomposition
Hyperbolic Gramian Volumes for Multimodal Alignment
Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping
CaReFlow: Cyclic Adaptive Rectified Flow for Multimodal Fusion
CoRiM: Conflict-driven Risk Minimization for Dynamic Multimodal Fusion
Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models
SAMTok: Representing Any Mask with Two Words
Cinematic Audio Source Separation Using Visual Cues
MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection
Anchor-Guided Gradient Alignment for Incomplete Multimodal Learning
Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing
Progressive Supernet Training for Efficient Visual Autoregressive Modeling
Test-Time Instance-Specific Parameter Composition: A New Paradigm for Adaptive Generative Modeling
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
Dual-Granularity Memory for Efficient Video Generation
MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene
Object-WIPER: Training-Free Object and Associated Effect Removal in Videos
Chain of Event-Centric Causal Thought for Physically Plausible Video Generation
FlowMotion: Training-Free Flow Guidance for Video Motion Transfer
DNF-SR: Dual-Input and Negative-Aware Feature Fine-Tuning for Real-World Image Super-Resolution
Edge-Focused Super-Resolution for Omnidirectional Images with Spherical Geometric Augmentation
Disentangled Textual Priors for Diffusion-based Image Super-Resolution
Rethinking Diffusion Model-Based Video Super-Resolution: Leveraging Dense Guidance from Aligned Features
FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution
Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization
Human Geometry Distribution for 3D Animation Generation
Towards Storytelling Animations: Joint Synthesis of Human and Camera Motions
MoLingo: Motion–Language Alignment for Text-to-Human Motion Generation
Toward Early Quality Assessment of Text-to-Image Diffusion Models
CoD: A Diffusion Foundation Model for Image Compression
SpiralDiff: Spiral Diffusion with LoRA for RGB-to-RAW Conversion Across Cameras
Landscape-Awareness for Geometric View Diffusion Model
Otil: Accelerating Diffusion Model Inference via Communication-Efficient Multi-GPU Parallelism
OralGPT-Omni: A Versatile Dental Multimodal Large Language Model
CrossHOI-Bench: A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods
Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench
KαLOS finds Consensus: A Meta-Algorithm for Evaluating Inter-Annotator Agreement in Complex Vision Tasks
Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision–Language Understanding
LangRef3DGS: Natural Language-Guided 3D Referential Segmentation from Partial Observations via 3D Gaussian Splatting
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence
Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
Geometrically-Constrained Agent for Spatial Reasoning
Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark
DiT-Distill: Open-Set Fine-Grained Retrieval via Generative Curriculum Knowledge
Rethinking BCE Loss for Multi-Label Image Recognition with Fine-Tuning
CAST: Context-Aware Dynamic Latent Space Transformation for Interactive Text-to-Image Retrieval
Interactive Episodic Memory with User Feedback
Seeing without Pixels: Perception from Camera Trajectories
StreamRAG: Enhancing Real-Time Video Understanding with Retrieval Augmentation
SkillSight: Efficient First-Person Skill Assessment with Gaze
Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
Making the Classification Explanation Faithful to the Confidence Score
Intrinsic Concept Extraction Based on Compositional Interpretability
FMPose3D: monocular 3D pose estimation via flow matching
ELiC: Efficient LiDAR Geometry Compression via Cross-Bit-depth Feature Propagation and Bag-of-Encoders
Hg-I2P: Bridging Modalities for Generalizable Image-to-Point-Cloud Registration via Heterogeneous Graphs
Adaptive 3D Perception for Small Aerial Targets Under Sparse Sampling via Reinforcement Learning
StreamVLO: Streaming Visual–LiDAR Odometry with Cumulative Drift Compensation
Towards Calibrating Prompt Tuning of Vision- Language Models
OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning
Language-guided Frequency Modulation for Large Vision-Language Models
DPL: Decoupled Prototype Learning for Enhancing Robustness of Vision–Language Transformers to Missing Modalities
NeuROK: Generative 4D Neural Object Kinematics
BrickNet: Graph-Backed Generative Brick Assembly
CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation
The Invisible Gorilla Effect in Out-of-distribution Detection
Interpretable Debiasing of Vision-Language Models for Social Fairness
Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework
Machine Unlearning via Adaptive Gradient Reweighting and Multi-stage Objective Optimization
Taming Noise-Induced Prototype Degradation for Privacy-Preserving Personalized Federated Fine-Tuning
FedMOP: Achieving Enhanced Privacy and Performance in Federated Learning via Momentum Orthogonal Projection
Single-Round Scalable Analytic Federated Learning
FedRE: A Representation Entanglement Framework for Model-Heterogeneous Federated Learning
Spatial Matters: Position-Guided 3D Referring Expression Segmentation
TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model
Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs
Multi-Hierarchical Contrastive Spectral Fusion for Multi-View Clustering
SECOS: Semantic Capture for Rigorous Classification in Open-World Semi-Supervised Learning
Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning
Large-scale Robust Enhanced Ensemble Clustering via Outlier Decoupling
DriveLaW: Unifying Planning and Video Generation in a Latent Driving World
InvCoSS: Inversion-driven Continual Self-supervised Learning in Medical Multi-modal Image Pre-training
Latent Chain-of-Thought World Modeling for End-to-End Driving
Failure Modes for Deep Learning–Based Online Mapping: How to Measure and Address Them
DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance
Diffusion Forcing Planner: History-Annealed Planning with Time-Dependent Guidance for Autonomous Driving
EvObj: Learning Evolving Object-centric Representations for 3D Instance Segmentation without Scene Supervision
Robust Promptable Video Object Segmentation
Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models
BEV-CAR: Enhancing Monocular Bird’s Eye View Segmentation with Context-Aware Rasterization
Exploring the Underwater World Segmentation without Extra Training
Towards Dynamic Modality Alignment in Multimodal Continual Learning
ϕ-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models
Smart Replay: Adaptive Scheduling of Memory Rehearsal for Computational Resource-Aware Incremental Learning
ReBaPL: Repulsive Bayesian Prompt Learning
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
DeX-Portrait: Disentangled and Expressive Portrait Animation via Explicit and Latent Motion Representations
IMAIA: Interactive Maps AI Assistant for Travel Planning and Geo-Spatial Intelligence
IDperturb: Enhancing Variation in Synthetic Face Generation via Angular Perturbations
ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and Test-time Generative Adaptation
MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models
Cross-Modal Attention Calibration for LVLM Hallucination Mitigation
Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models
AniMimic: Imitating 3D Animation from Video Priors
MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE
Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control
BIT: Matching-based Bi-directional Interaction Transformation Network for Visible-Infrared Person Re-Identification
Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification
Diversity over Uniformity: Rethinking Representation in Generated Image Detection
Mining Instance-Centric Vision–Language Contexts for Human–Object Interaction Detection
EEGiT: Teaching Vision Transformers to Understand the EEG signal
FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts
UAV-CB: A Complex-Background RGB–T Dataset and Local Frequency Bridge Network for UAV Detection
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
StreamReady: Learning What to Answer and When in Long Streaming Videos
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
Ego-Grounding for Personalized Question-Answering in Egocentric Videos
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding
RenderFlow: Single-Step Neural Rendering via Flow Matching
ResDiT: Evoking the Intrinsic Resolution Scalability in Diffusion Transformers
Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers
ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers
H^2A^2: Homogeneity-Aware and Heterogeneity-Aware Feature Perception for Unified Indoor 3D Object Detection
Towards Intrinsic-Aware Monocular 3D Object Detection
SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection
FailureAtlas: Mapping the Failure Landscape of T2I Models via Active Exploration
HDR-VLM: HDR-Domain Adaptation of VLMs and Preference-Aligned Quality Assessment for HDR Video Color Grading
BiomedCCPL: Causal Conditional Prompt Learning for Biomedical Vision-Language Models
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing
SunFaded: Illumination-Aware Gaussian Splatting for Dark Scenes with Camera-Mounted Active Lighting
TokenSplat: Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction
AeroGS: Scale-Aware Gaussian Splatting for Pose-Free Dynamic UAV Scene Reconstruction
AERGS-SLAM: Auto-Exposure-Robust Stereo 3D Gaussian Splatting SLAM
Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images
TagSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and Tracking
Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting
ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction
MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes
ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps
DialogueVPR: Towards Conversational Visual Place Recognition
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress
Grounding Everything in Tokens for Multimodal Large Language Models
Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory
Think Visually, Reason Textually: Vision-Language Synergy in Abstract Reasoning
VKG-QA: Visual Knowledge Graph-based Question Answer for Large Multimodal Models
Human-like Abstract Visual Reasoning via Understanding and Solving Reasoning Loop
VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment
Generative Video Compression with One-Dimensional Latent Representation
Learned Image Compression via Sparse Attention and Adaptive Frequency
VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference
Parallax to Align Them All: An OmniParallax Attention Mechanism for Distributed Multi-View Image Compression
VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models
HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps
TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement
OVOD-Agent: A Markov–Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection
Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection
Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding
Uncertainty-Aware Modality Fusion for Unaligned RGB-T Salient Object Detection
More Than Meets the Eye: A Unified Image Fusion Framework via Semantic-Pixel Entropy Trade-off for Zero-Shot Generalization
Beyond Sequential Tools: A Unified VLM Agent System for Photographic Post-Processing via Dynamic Multi-Expert Fusion
Multi-modal Frequency Decomposition Network for Semantic Scene Completion
FusionRegister: Every Infrared and Visible Image Fusion Deserves Registration
Enhancing Unregistered Hyperspectral Image Super-Resolution via Unmixing-based Abundance Fusion Learning
LRHDR: Learning Representation-enhanced HDR Video Reconstruction
Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment
AgentDet: A Shared-Blackboard Multi-Agent Framework for Zero-/Few-Shot Object Detection
Noise-Aware Few-Shot Learning through Bi-directional Multi-View Prompt Alignment
Progressive Mask Distillation for Self-supervised Video Representation
DetectSCI: Toward Object-Guided ROI Reconstruction for High-Resolution Video Snapshot Compressive Imaging
Solving a Nonlinear Blind Inverse Problem for Tagged MRI with Physics and Deep Generative Priors
GH-NAF: Grid-Adaptive Hash-Level–Attended Neural Attenuation Fields for Discrepancy-Aware CBCT
Computer Vision with a Superpixelation Camera
Multi-Scale Gradient-Guided Unrolling Architecture with Adaptive Mamba for Compressive Sensing
Bulk RNA-seq Guided Multi-modal Detection of Anomalous Regions in Human Cancer via Spatial Transcriptomics
ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction
PhyCo: Learning Controllable Physical Priors for Generative Motion
Unified Multimodal Models as Auto-Encoders
Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models
Neural Differentiation in Deep Networks: A Theoretical Framework for Expressivity and Representational Diversity
DuetMerging: Synergizing Dynamic and Static Strategies for Mitigating Task Interference in Model Merging
Generative Modeling of Weights: Generalization or Memorization?
Vision-Oriented Lightweight Neural Architecture Search with Budget-Adaptive Evaluation
Improving Sparse Autoencoder with Dynamic Attention
Hyperbolic Busemann Neural Networks
FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching
NG-GS: NeRF-guided 3D Gaussian Splatting Segmentation
MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision
Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation
MatchMask: Mask-Centric Generative Data Augmentation for Label-Scarce Semantic Segmentation
FUSAR-GPT: A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery
UniChange: Unifying Change Detection with Multimodal Large Language Model
See What We Cannot See: A Geo-guided Reasoning Benchmark for Object Counting under Adverse Earth Observation Conditions
MM-OVSeg: Multimodal Optical–SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing
Fourier Angle Alignment for Oriented Object Detection in Remote Sensing
Good Can Sometimes be Bad: A Unified Attack against 3D Point Cloud Classifier by a Flexible Isotropic Resampling
FeatureFool: Zero-Query Fooling of Video Models via Feature Map
AdvFM: Lookahead Flow-Matching Velocity-Field Attacks for Imperceptible and Transferable Adversarial Examples
The Power of Decaying Steps: Enhancing Attack Stability and Transferability for Sign-based Optimizers
Your Classifier Can Do More: Towards Balancing the Gaps in Classification, Robustness, and Generation
Hierarchical Attacks for Multi‑Modal Multi‑Agent Reasoning
CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning
Δynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos
Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols
GeCo-SRT: Geometry-aware Continual Adaptation for Cross-Task Sim-to-Real Transfer
BiPreManip: Learning Affordance-Based Bimanual Pre-Manipulation through Anticipatory Collaboration
ActiveGrasp: Information-Guided Active Grasping with Calibrated Energy-based Model
Learning Surgical Robotic Manipulation with 3D Spatial Priors
SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
RaUF: Learning the Spatial Uncertainty Field of Radar
Instance-level Visual Active Tracking with Occlusion-Aware Planning
Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
Toward Low-Cost yet Effective Temporal Learning for UAV Tracking
Occlusion-Aware SORT: Observing Occlusion for Robust Multi-Object Tracking
Beyond Explicit Language: Plug-and-Play Visual-to-Linguistic Modeling Toward General Object Tracking
From Panel to Pixel: Zoom-In Vision–Language Pretraining from Biomedical Scientific Literature
LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings
Diffusion with a Linguistic Compass: Steering the Generation of Clinically Plausible Future sMRI Representations for Early MCI Conversion Prediction
Personalized Longitudinal Medical Report Generation via Temporally-Aware Federated Adaptation
Duala: Dual-Level Alignment of Subjects and Stimuli for Cross-Subject fMRI Decoding
OmniBrainBench: A Comprehensive Multimodal Benchmark for Brain Imaging Analysis Across Multi-stage Clinical Tasks
Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control
FISHuman: Fine-grained Single-image 3D Human Reconstruction via Multi-view 4D Remeshing
RAM: Recover Any 3D Human Motion in-the-Wild
From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction
M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh Reconstruction
ReGenHOI: Unifying Reconstruction and Generation for 3D Human–Object Interaction Understanding
Through the Frequency Lens: Cross-Domain Generalisable Gaze Estimation with Adaptive Modulation
SHands: A Multi-View Dataset and Benchmark for Surgical Hand-Gesture and Error Recognition Toward Medical Training
Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation
Learning Forgery-Aware Lip Representations Without Forgery Priors
Beyond [CLS] Token: Query-Driven Token-Level Forgery Purification for Generalizable Deepfake Detection
TokenTrace: Multi-Concept Attribution through Watermarked Token Recovery
Unleashing Vision-Language Semantics for Deepfake Video Detection
RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic Generation
Zero-shot Detection of AI-Generated Image via RAW-RGB Alignment
Investigating Self-Supervised Representations for Audio-Visual Deepfake Detection
FastRef: Fast Prototype Refinement for Few-shot Industrial Anomaly Detection
RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation
MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models
Wavelet-Driven 3D Anomaly Detection under Pose-Agnostic and Sparse-View
GPFlow: Gaussian Prototype Probability Flow for Unsupervised Multi-Modal Anomaly Detection
HP-Edit: A Human-Preference Post-Training Framework for Image Editing
RebRL: Reinforcing Discrete Visual Diffusion Models with Rebalanced Timestep Credits
Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos
Towards Fine-Grained Attribution: Instance-Aware Preference Optimization for Aligning Diffusion Models
SketchRevive: Fine-Grained Pixel-to-Vector Sketch Completion with Diffusion-Prior-Guided Multimodal LLMs
You Only Erase Once: Erasing Anything without Bringing Unexpected Content
NS-Diff: Fluid Navier–Stokes Guided Video Diffusion via Reinforcement Learning
Smoothing the Score Function to Enhance Generalization in Diffusion Models
PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models
AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation
Interpretable Prompts made Edit-Friendly: Token-to-Token Similarity Reduction in dLLMs for Edit-Friendly Hard Prompt Inversion
Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration
EasyOmnimatte: Taming Pretrained Inpainting Diffusion Models for End-to-End Video Layered Decompositio
Hierarchical Codec Diffusion for Video-to-Speech Generation
Semantic Alignment for Pose-Invariant Identity Preserving Diffusion
Causality in Video Diffusers is Separable from Denoising
2ndMatch: Finetuning Pruned Diffusion Models via Second-Order Jacobian Matching
MacTok: Robust Continuous Tokenization for Image Generation
Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation
Beyond the Golden Data: Resolving the Motion-Vision Quality Dilemma via Timestep Selective Training
Elucidating the SNR-t Bias of Diffusion Probabilistic Models
RFDM: Residual Flow Diffusion Models for Video Editing
FreqEdit: Preserving High-Frequency Features for Robust Multi-Turn Image Editing
Graph-Guided Online Concept Erasure for Text-to-Image Diffusion Models
Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers
CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration
D2Cache: Second-Order Delta Caching for Higher Video Diffusion Acceleration
Text-Driven 3D Hand Motion Generation from Sign Language Data
Guiding Diffusion Models with Semantically Degraded Conditions
Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
Improving Diffusion Generalization with Weak-to-Strong Segmented Guidance
Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation
SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models
MusicInfuser: Making Video Diffusion Listen and Dance
TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection
ArtLLM: Generating Articulated Assets via 3D LLM
A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps
Refracting Reality: Generating Images with Realistic Transparent Objects
PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing
NitroGen: An Open Foundation Model for Generalist Gaming Agents
Differentiable Laplacian Matrix Guided Superpixel Segmentation
UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass
Image Generation from Contextually-Contradictory Prompts
LeapAlign: Post-training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues
FrankenMotion: Part-level Human Motion Generation and Composition
SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation
SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark
X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis
Beyond Single Images: A Comprehensive Benchmark for Album-Level Vision-Language Understanding
Unique Lives, Shared World: Learning from Single-Life Videos
Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation
Erasing Thousands of Concepts: Towards Scalable and Practical Concept Erasure for Text-to-Image Diffusion Models
Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance
Geo2: Geometry-Guided Cross-view Geo-Localization and Image Synthesis
Unified Latent Space for Understanding and Generation via Semantic Auto-encoder
BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation
VGGT-Ω
TokenLight: Precise Lighting Control in Images using Attribute Tokens
Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models
GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global–Local Feature Fusion
Global Underwater Geolocation from Time-Lapse Polarization Imagery
Building Robust Vision Encoders for Cross-Dataset Evaluation in Immunofluorescent Microscopy
Lighting in Motion: Spatiotemporal HDR Lighting Estimation
Visual Grounding for Object Questions
Revisiting Monocular SLAM with Spatio-Temporal Scene Modeling
Perception Characteristics Distance: Measuring Stability and Robustness of Perception System in Dynamic Conditions under a Certain Decision Rule
Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views
Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens
Material Magic Wand: Material-Aware Grouping of 3D Parts in Untextured Meshes
GraPHFormer: A Multimodal Graph Persistent Homology Transformer for the Analysis of Neuroscience Morphologies
UniVerse: A Unified Modulation Framework for Segmentation-Free, Disentangled Multi-Concept Personalization
PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation
EasyV2V: A High-quality Instruction-based Video Editing Framework
Resolving the Identity Crisis in Text-to-Image Generation
Evidential Neural Radiance Fields
CaptionQA: Is Your Caption as Useful as the Image Itself?
Is the Modality Gap a Bug or a Feature? A Robustness Perspective
Hear What You See: Video-to-Audio Generation with Diffusion Transformer and Semantic-Temporal Alignment-Ranked Direct Preference Optimization
Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition
Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models
MFEN: Multi-Frequency Expert Network for Visible-Infrared Person Re-ID
Unleashing Stealthy Backdoor Pandemic by Infecting a Single Diffusion Model
CHEEM: Continual Learning by Reuse, New, Adapt and Skip - A Hierarchical Exploration-Exploitation Approach
Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models
Joint Learning of General and Diverse Patterns with Mixture of Memory Experts for Weakly-Supervised Video Anomaly Detection
SIF: Semantically In-Distribution Fingerprints for Large Vision-Language Models
SASNet: Spatially-Adaptive Sinusoidal Networks for INRs
Stable Mean Flow: Lyapunov-Inspired One-Step Flow Matching
Structure-Aware Representation Distillation for Tiny-Dense Object Segmentation
Ghosts in the Point Clouds: De-glaring LiDAR in the Transient Domain
From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning
Circular-DPO: Aligning Multi-Stage 3D Generative Models via Preference Feedback Loop
SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer
MetricHMSR: Metric Human Mesh and Scene Recovery from Monocular Images
LAMP: Language-Assisted Motion Planning for Controllable Video Generation
Phrase-grounded APO for Improving Chest X-ray Report Generation
Vibe Spaces for Creatively Connecting and Expressing Visual Concepts
Learning 3D Reconstruction with Priors in Test Time
DSO: Direct Steering Optimization for Bias Mitigation
LumiMotion: Improving Gaussian Relighting with Scene Dynamics
CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection
Reasoning-Driven Anomaly Detection and Localization with Image-Level Supervision
ProPhy: Progressive Physical Alignment for Dynamic World Simulation
UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders
Consistency Beyond Contrast: Enhancing Open-Vocabulary Object Detection Robustness via Contextual Consistency Learning
Towards Robust Vision Transformers: Path Dependency Analysis and a Simple Two-Stage Adversarial Training
AdaDexTrack: Dynamic Modulation for Adaptive and Generalizable Dexterous Manipulation Tracking
TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos
QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition
CLEX: Complementary Label Exchange Learning for Noisy Facial Expression Recognition
SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization
Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study
Sparse Spectral LoRA: Routed Experts for Medical VLMs
Beyond Rule-Based Agents: Active Markov Games for Realistic Multi-Agent Interaction in Autonomous Driving
LLM-Guided Probabilistic Fusion for Label-Efficient Document Layout Analysis
Local Precise Refinement: A Dual-Gated Mixture-of-Experts for Enhancing Foundation Model Generalization against Spectral Shifts
Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems
Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal
More Natural, More Real: Object-aware Gaussian Splatting for 3D Visual Decoding from Human Brain
Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model
Breaking the Scalability Limit of Multi-Projector Calibration with Embedded Cameras
GDRO: Group-level Reward Post-training Suitable for Diffusion Models
MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning
Learnability-Driven Submodular Optimization for Active Roadside 3D Detection
When to Think and When to Look: Uncertainty-Guided Lookback
BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
LoPrune: Efficient Data Pruning for LoRA-Based Fine-Tuning of Vision Transformer
DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning
Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization
Velox: Learning Representations of 4D Geometry and Appearance
PAI-Bench: A Comprehensive Benchmark For Physical AI
ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving
Reward Sharpness-Aware Fine-Tuning for Diffusion Models
DTG-Restore: Training-Free Diffusion Refinement for Generative Video Super-Resolution
Factorized Context Aggregation for Robust Cancer Risk Estimation via Soft Re-Ranked Retrieval and Hierarchical Anchors
Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout
InterPhys: Physics-aware Human Motion Synthesis in a Dynamic Scene
DeDelayed: Deleting Remote Inference Delay via On-Device Correction
RAYNOVA: Scale-Temporal Autoregressive World Modeling in Ray Space
Conditional Factuality Controlled LLMs with Generalization Certificates via Conformal Sampling
A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning
BluRef: Unsupervised Image Deblurring with Dense-Matching References
MooCap: A Multi-View Benchmark for Cow-Object-Human Interaction and Behavior Dynamics
Point Cloud as a Foreign Language for Multi-modal Large Language Model
Fresco: Frequency–Spatial Consistent Optimization for Fine-Grained Head Avatar Modeling
Contact-Aware Neural Dynamics
VisPlay: Self-Evolving Vision-Language Models
EventGait: Towards Robust Gait Recognition with Event Streams
Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers
ShapeAR: Generating Editable Shape Layers via Autoregressive Diffusion
Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation
DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference
Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness
UniDAC: Universal Metric Depth Estimation for Any Camera
SAT-RRG: LLM-Guided Self-Adaptive Training for Radiology Report Generation with Token-Level Push–Pull Optimization
2D-LFM: Lifting Foundation Model without 3D Supervision
Aligning Text, Images and 3D Structure Token-by-Token
Learning Straight Flows: Variational Flow Matching for Efficient Generation
Captain Safari: A World Engine with Pose-Aligned 3D Memory
MOMO: Mars Orbital MOdel Foundation Model for Mars Orbital Applications
Forecasting 3D Scanpaths in Egocentric Video
BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections
Do Vision-Language Models Leak What They Learn? Adaptive Token-Weighted Model Inversion Attacks
Simple Agents Outperform Experts in Biomedical Imaging Workflow Optimization
Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching
TIACam: Text-Anchored Invariant Feature Learning with Auto-Augmentation for Camera-Robust Zero-Watermarking
Retrieving Counterfactuals Improves Visual In-Context Learning
Measuring the (Un)Faithfulness of Concept-Based Explanations
A More Word-like Image Tokenization for MLLMs
Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability
Masked Representation Modeling for Domain-Adaptive Segmentation
RARE: Learn to RAnk and REtrieve for Monocular 3D Object Detection
Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining
Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion
FedCART: Tackling Long-Tailed Distributions in Federated Adversarial Training via Classifier Refinement
LaRP: Efficient Multi-View Inpainting with Latent Reprojection Priors
Selective Amnesia using Contrastive Subnet Erasure for Class Level Unlearning in Vision Models
Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage
Gaussian Splatting-based Low-Rank Tensor Representation for Multi-Dimensional Image Recovery
History to Future: Evolving Agent with Experience and Thought for Zero-shot Vision-and-Language Navigation
CIGMA: Causal Information-Gain Mechanistic Attribution of Attention Heads in Vision Transformers
Minimal Constraint Relaxation for Multiview Autocalibration
The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection
FPSBench: A Benchmark for Video Understanding at High Frame Rates
Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation
TaskForce: Cooperative Multi-agent Reinforcement Learning for Multi-task Optimization
Energy-GS: Image Energy-guided Pose Alignment Gaussian Splatting with redesigned pose gradient flow
From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity
How Much 3D Do Video Foundation Models Encode?
Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision
STRNet: Visual Navigation with Spatio-Temporal Representation through Dynamic Graph Aggregation
ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval
QUANTIPHY: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models
Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos
WPT: World-to-Policy Transfer via Online World Model Distillation
Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model
Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations
PMRNet: Physics-informed Multi-scale Refinement Network for Medical Image Segmentation
MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry
Ref4D-VideoBench: Four-Dimensional Reference-Based Evaluation of Text-to-Video Generative Models
EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation
Prime Once, then Reprogram Locally: An Efficient Alternative to Black-Box Service Model Adaptation
An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning
Artiverse: A Diverse and Physically Grounded Dataset for Articulated Objects
3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects
Reinforcing Structured Chain-of-Thought for Video Understanding
Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
Co-Me: Confidence Guided Token Merging for Visual Geometric Transformers
PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment
FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants
CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization
Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post‑hoc Debiasing in Vision-Language Models
RunawayEvil: Jailbreaking the Image-to-Video Generative Models
REACH: Explicit Recovery Behavior for Diffusion Policies
Dual Ascent Diffusion for Inverse Problems
MotionEdit: Benchmarking and Learning Motion-Centric Image Editing
SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding
RankOOD - Class Ranking-based Out-of-Distribution Detection
Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning
How Far Can We Go With Synthetic Data for Audio-Visual Sound Source Localization?
R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection
KnowVal: A Knowledge-Augmented and Value-Guided Autonomous Driving System
Towards Knowledge-augmented Bayesian Deep Learning For Computer Vision
Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision
Event-Illumination Collaborative Low-light Image Enhancement with a High-resolution Real-world Dataset
OptiMVMap: Offline Vectorized Map Construction via Optimal Multi-vehicle Perspectives
Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving
ForceVLA2: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation
NOWA: Null-space Optical Watermark for Invisible Capture Fingerprinting and Tamper Localization
PECCVAI: Overcoming the Brittleness of AI Image Watermarking Under Visual Paraphrasing Attacks
ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding
Every Error has Its Magnitude: Asymmetric Mistake Severity Training for Multiclass Multiple Instance Learning
Concept-Aware Batch Sampling Improves Language-Image Pretraining
DiffSoup: Direct Differentiable Rasterization of Triangle Soup for Extreme Radiance Field Simplification
CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation
Learnability-Guided Diffusion for Dataset Distillation
D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation
Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection
Harnessing the Power of Foundation Models for Accurate Material Classification
Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction
Same or Not? Enhancing Visual Perception in Vision-Language Models
VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs
PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency
SAMIX: Reinforcing SAM2 with Semantic Adapter and Reference Selecting Policy for Mix-Supervised Segmentation
Reinforcing Video Reasoning Segmentation to Think Before It Segments
FlashDecoder: Real-Time Latent-to-Pixel Streaming Decoder with Transformers
Render-to-Adapt: Unsupervised Personal Adaptation for Gaze Estimation
Differentially Private 2D Human Pose Estimation
Cleaning the Pool: Progressive Filtering of Unlabeled Pools in Deep Active Learning
Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning
BALM: A Model-Agnostic Framework for Balanced Multimodal Learning under Imbalanced Missing Rates
Scalable Trajectory Generation for Whole-Body Mobile Manipulation
NimbusGS: Unified 3D Scene Reconstruction under Hybrid Weather
Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning
Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation
SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration
AMap: Distilling Future Priors for Ahead-Aware Online HD Map Construction
Verifying Neural Network Robustness with Dual Perturbations
Generalizable Video Quality Assessment via Weak-to-Strong Learning
QuCNet: Quantum Deep Learning Driven Multi-Circuit Network for Remote Sensing Image Classification
Ego: Embedding-Guided Personalization of Vision-Language Models
Learning to Infer Parameterized Representations of Plants from 3D Scans
F^2HDR: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling
Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment
Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence
Uni-Hema: Unified Model for Digital Hematopathology
MatSpray: Fusing 2D Material World Knowledge on 3D Geometry
MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning
Distribution-Aligned Multimodal Fusion for Robust Object Detection
Obstruction Reasoning for Robotic Grasping
Revisiting Pose Sensitivity in Splat-based Computed Tomography under Sparse-view Reconstruction
The Golden Subspace: Where Efficiency Meets Generalization in Continual Test-Time Adaptation
Splat-Based Metal Artifact Reduction in Cone-Beam CT via Compact Attenuation Modeling
VISTA: A Test-Time Self-Improving Video Generation Agent
Anomaly as Non-Conformity via Training-Free Graph Laplacian Energy Minimization
Representing 3D Faces with Learnable B-Spline Volumes
Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos
EReCu: Pseudo-label Evolution Fusion and Refinement with Multi-Cue Learning for Unsupervised Camouflage Detection
Easy2Hard: From Partially to Fully Unmatched Modalities as Negative Samples in Contrastive Learning
OrthoFuse: Training-free Riemannian Fusion of Orthogonal Style-Concept Adapters for Diffusion Models
Language Models Can Explain Visual Features via Steering
Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning
AvatarPointillist: AutoRegressive 4D Gaussian Avatarization
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning
Reconstructing CLIP for Open-Vocabulary Dense Perception
Streaming Video Crime Anticipation with Spatio-Temporal Causal Reasoning
PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning
ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars
SyncDreamer: Controllable and Expressive Avatar Generation Beyond the Talking Head
Neural-Centric Video Processing Pipeline for Unified Multi-Task Inference
PhysVid: Physics Aware Local Conditioning for Generative Video Models
Test-time Sparsity for Extreme Fast Action Diffusion
BEA-GS: BEyond RAdiance Supervision in 3DGS for Precise Object Extraction
SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation
Beyond Soft Label: Dataset Distillation via Orthogonal Gradient Matching
WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
DeepAlign: Mitigating Modality Conflict through Modality-Specific Alignment
MUFASA: A Multi-Layer Framework for Slot Attention
Prompt Yourself: Awakening Textual Semantics in 1D Visual Tokenizers
UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting
VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation
Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery
GeneVAR: Causal MeanFlow for Autoregressive Gene-to-WSI Tile Synthesis
AdaPrior: Bayesian-Inspired Adaptive Prior Correction for Long-Tailed Continual Learning
Composing Concepts from Images and Videos via Concept-prompt Binding
Computation and Communication Efficient Federated Unlearning via On-server Gradient Conflict Mitigation and Expression
Video Panels for Long Video Understanding
NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining
TAG-MoE: Task-Aware Gating for Unified Generative Mixture-of-Experts
PaNDaS: Learnable Shape Interpolation Modeling with Localized Control
RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning
AIMDepth: Asymmetric Image-Event Mamba for Monocular Depth Estimation
Recovering Physically Plausible Human-Object Interactions from Monocular Videos
SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings
What Is It Like to Be a Noise? An Entropy-based Gaussian Noise Regularization for Diffusion Models
PatchAlign3D: Local Feature Alignment for Dense 3D Shape Understanding
Routing on Demand: DSNet for Efficient Progressive Point Cloud Denoising
RAVEN: Radar Adaptive Vision Encoders for Efficient Chirp-wise Object Detection and Segmentation
Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding
Group Editing: Edit Multiple Images in One Go
UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying
VOSR: A Vision-Only Generative Model for Image Super-Resolution
Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning
pH-Strips for Selective Forgetting: A Blunt but Fast Diagnostic Baseline for Machine Unlearning
PlannerRFT: Reinforcing Diffusion Planners through Closed-Loop and Sample-Efficient Fine-Tuning
R3-PCQA: Ray-Reprojection-Reinforcement for No-Reference 3D Point Cloud Quality Assessment
CoLoR: The Devil is in Scene Coordinate Regression for Large-Scale Visual Localization
MIBURI: Towards Expressive Interactive Gesture Synthesis
FINER: MLLMs Hallucinate under Fine-grained Negative Queries
SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning
EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation
Explaining CLIP Zero-shot Predictions Through Concepts
Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting
PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving
LAM: Language Articulated Object Modelers
The Road Less Seen: Segment Exploration for Weakly Supervised Video Anomaly Detection
Representation-Steered Incremental Adapter-Tuning for Class-Incremental Learning with Pre-Trained Models
Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing
ProgTrack: A Multi-Object Tracking Algorithm with Progressive Matching Strategy
PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction
Dejavu: Towards Experience Feedback Learning for Embodied Intelligence
Generative Neural Video Compression via Video Diffusion Prior
FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation
Coverage Optimization for Camera View Selection
DepthFocus: Controllable Depth Estimation for See-Through Scenes
Sampling-Aware Quantization for Diffusion Models
Scene-Centric Unsupervised Video Panoptic Segmentation
Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM
Condensed Test-Time Adaptation of VLMs for Action Recognition
VideoCoF: Unified Video Editing with Temporal Reasoner
Optical Diffraction-based Convolution for Semiconductor Lithography
Counterfactual VLA: Self-Reflective Vision-Language-Action Model with Adaptive Reasoning
OSMO: Open-vocabulary Self-eMOtion Tracking
GenColorBench: A Color Evaluation Benchmark for Text-to-Image Generation
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
Frequency Switching Mechanism for Parameter-Efficient Multi-Task Learning
Alert-CLIP: Abnormality-aware Latent-Enhanced Representation Tuning of CLIP for Video Anomaly Detection
ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology
Open the Motion Door: Atomic Motion Decomposition and Recomposition for Open-Vocabulary Motion Generation
InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization
Flowception: Temporally Expansive Flow Matching for Video Generation
Black-Box Domain Adaptation for Object Detection with Retention-Driven Knowledge Compression
CRAFT-LoRA: Content-Style Personalization via Rank-Constrained Adaptation and Training-Free Fusion
SceneTok: A Compressed, Diffusable Token Space for 3D Scenes
Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers
MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation
AURA: Multi-modal Shared Autonomy for Urban Navigation
BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models
Pip-Stereo: Progressive Iterations Pruner for Iterative Optimization based Stereo Matching
DiffuView: Multi-View Diffusion Pretraining for 3D Aware Robotic Manipulation
TopoCL: Topological Contrastive Learning for Medical Imaging
Accelerating Streaming Video Large Language Models via Hierarchical Token Compression
PackUV: Packed Gaussian UV Maps for 4D Volumetric Video
The Missing Point in Vision Transformers for Universal Image Segmentation
Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation
Training-free Motion Factorization for Compositional Video Generation
Enhancing Descriptive Captions with Visual Attributes for Multimodal Perception
Scalable Multi-View Subspace Clustering with Tensorized Anchor Guidance
Next-Scale Prediction: A Self-Supervised Approach for Real-World Image Denoising
SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images
FoSS: Modeling Long-Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier–State Space Integration
Spectral Mixture-of-Experts for Continual Learning
Learning Hierarchical Hyperbolic Mixture Model for Part-aware 3D Generation
Adaptive Capacity Autoregressive Visual Tracking
A Polynomial Chaos Framework for Causal Discovery in Nonlinear Uncertain Systems
FlashIn: Fast and Accurate Image Inversion for Real-time Image Editing
Forecast the Principal, Stabilize the Residual: Subspace-Aware Feature Caching for Diffusion Transformers
PixelDiT: Pixel Diffusion Transformers for Image Generation
RS-SSM: Refining Forgotten Specifics in State Space Model for Video Semantic Segmentation
HandWorld: Hand-Centric Unified Video Action Generation
BOP-ASK: Object-Interaction Reasoning for Vision-Language Models
MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation
CogniVerse: Revolutionizing Multi-Modal Retrieval-Augmented Generation with Cognitive Reflection and Geometric Reasoning
Learning to Learn Weight Generation via Local Consistency Diffusion
CrossHOI: Learning Cross-View Representations for Monocular 3D Human-Object Interaction Reconstruction
Bridging Fidelity-Reality with Controllable One-Step Diffusion for Image Super-Resolution
PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation
Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model
Agentic Video Summarization via Self-Reflecting Multimodal Understanding
Your Latent Mask is Wrong: Pixel-Equivalent Latent Compositing for Diffusion Models
DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models
Dynamic Important Example Mining for Reinforcement Finetuning
PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts
Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning
Spatia: Video Generation with Updatable Spatial Memory
Learning to Select Visual Tools from Experience
Zero-Shot Depth Completion with Vision-Language Model
Batch Loss Score for Dynamic Data Pruning
Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images
HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition
Detect Anything via Next Point Prediction
SuP: Sub-cloud Driven Point Cloud Registration
DynamicsBoost: Dynamic Plausible Video Generation via Annotation-Free Continuation Preference Optimization
PV-Ground: Text-Guided Point-Voxel Interaction for 3D Visual Grounding
PoInit-of-View: Poisoning Initialization of Views Transfers Across Multiple 3D Reconstruction Systems
Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild
4DP-QA: Scalable QA for 4D Perception in Vision Language Models
DDT: Decoupled Diffusion Transformer
3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds
GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator
Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors
SeD-UD: An Influence-Driven and Hierarchically-Decoupled Information Bottleneck for Multimodal Intent Recognition
EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization
RAAS: LLM Agentic System Architecture Search with GRPO
Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image
RPGFusion: 4D Radar Prior-Guided Multi-Modal Fusion for 3D Detection
Underground Plant Exploration: Non-Destructive 3D Root Assessment with GPR Based on Point Graph Neural Network
HSI-GPT2: A Dual-Granularity Large Motion Reasoning Model with Diffusion Refinement for Human–Scene Interaction
Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty
Dynamics-Aware Preference Optimization for Vision-Language Models
STCDiT: Spatio-Temporally Consistent Diffusion Transformer for High-Quality Video Super-Resolution
EMGauss: Continuous Slice-to-3D Reconstruction via Dynamic Gaussian Modeling in Volume Electron Microscopy
Learning Differentiable Hierarchies in 3D Gaussian Splatting
LRDUN: A Low-Rank Deep Unfolding Network for Efficient Spectral Compressive Imaging
Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context
Model Merging in the Essential Subspace
Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling
Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer
Region-Wise Correspondence Prediction between Manga Line Art Images
MeanFlow Transformers with Representation Autoencoders
S2D: Selective Spectral Decay for Quantization-Friendly Conditioning of Neural Activations
VCP-Attack: Visual-Contrastive Projection for Transferable Black-Box Targeted Attacks on Large Vision-Language Models
Frequency-Aware Flow Matching for High-Quality Image Generation
Tri-Modal Fusion Transformers for UAV-based Object Detection
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation
V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs
EgoX: Egocentric Video Generation from a Single Exocentric Video
CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects
Thinking with Programming Vision: Towards a Unified View for Thinking with Images
VideoMaMa: Mask-Guided Video Matting via Generative Prior
PhysGaia: A Physics-aware Benchmark with Multi-Body Interactions for Dynamic Novel View Synthesis
I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners
Structural Action Transformer for 3D Dexterous Manipulation
PositionIC: Unified Position and Identity Consistency for Image Customization
Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning
Pixel2Phys: Distilling Governing Laws from Visual Dynamics
Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers
ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos
MotionMaster: Generalizable Text-Driven Motion Generation and Editing
PAS: Prelim Attention Score for Detecting Object Hallucinations in Large Vision-Language Models
GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning
Omni-Attack: Adversarial Attacks on Open-Ended VQA in Black-Box Multimodal LLMs
In Pursuit of Pixel Supervision for Visual Pre-training
Lite Any Stereo: Efficient Zero-Shot Stereo Matching
GP-4DGS: Probabilistic 4D Gaussian Splatting from Monocular Video via Variational Gaussian Processes
Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves
SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls
UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes
Mobile-VTON: High-Fidelity On-Device Virtual Try-On
Understanding, Accelerating, and Improving MeanFlow Training
HiFICL: High-Fidelity In-Context Learning for Multimodal Tasks
Dehallu3D: Hallucination-Mitigated 3D Generation from a Single Image via Cyclic View Consistency Refinement
Linear Fundamental Matrix Estimation from 7 or 5 Points
Parameterized Prompt for Incremental Object Detection
UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling
BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting
HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models
Exact-GS: Mathematically Rigorous and Accurate 3D Gaussian Splatting for 3D X-ray Reconstruction
Your Dissimilarities Define You: Complementary Learning Exploiting Class Diversities
BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection
TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis
Energy Waveify and Redistribution for Test-Time Adaptation: A Control System Perspective
HAVE-Bench: Hierarchical Audio-Visual Evaluation from Perception to Interaction
Learning and Aligning Click-Aware Shape Prior for Interactive Amodal Instance Segmentation
Improving Motion in Image-to-Video Models via Adaptive Low-Pass Guidance
TM-BSN: Triangular-Masked Blind-Spot Network for Real-World Self-Supervised Image Denoising
Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation
Compositional Transformation Reasoning for Composed Video Retrieval
STCast: Adaptive Boundary Alignment for Global and Regional Weather Forecasting
ShreddingNet: Coarse-to-Fine Restoration for Multi-Source Shredded Manuscripts
OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation
RegFormer: Transferable Relational Grounding for Efficient Weakly-Supervised Human-Object Interaction Detection
MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting
SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead
PriVi: Towards a General-Purpose Video Model for Primate Behavior in the Wild
AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Affordance Correspondence
Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy
Real-Time Multimodal Fingertip Contact Detection via Depth and Motion Fusion for Vision-Based Human–Computer Interaction
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
Saliency-Driven Token Merging for Vision Transformers
Weight Space Representation Learning via Neural Field Adaptation
LiDAR-to-4DRadar Diffusion Bridge via Cross-Modal Alignment and Translation in Latent Space
UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark
GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance
Synthetic Curriculum Reinforces Compositional Text-to-Image Generation
Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing
OneSparse: A Unified Framework for Sparse Activation Layers in Vision Models
ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation
B^3-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta-Bernoulli Bayesian Updates
Towards Uncertainty-aware Unsupervised Domain Adaptation for Videos and Time-Series with Causal Optimal Transport
Turbo-GS: Accelerating 3D Gaussian Fitting for High-Resolution Radiance Fields
GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings
A Sanity Check for Multi-In-Domain Face Forgery Detection in the Real World
Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds
OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective
MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization
MotionHiFlow: Text-to-Motion via Hierarchical Flow Matching
WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
What Are You Doing? A Closer Look at Controllable Human Video Generation
MR-RAG: Multimodal Relevance-Aware Retrieval-Augmented Generation for Medical Visual Question Answering
Improving Vision-language Models with Perception-centric Process Reward Models
Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation
Let it Snow! Animating 3D Gaussian Scenes with Dynamic Weather Effects via Physics-Guided Score Distillation
Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking from Sparse Inertial Sensors and Ranging-based Between-sensor Distances
Grounded 3D-Aware Spatial Vision-Language Modeling
RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations
Mark4D: Temporally-Consistent Watermarking for 4D Gaussian Splatting
Unpaired Image Deraining Using Reward-Guided Self-Reinforcement Strategy
OMoBlur: An Object Motion Blur Dataset and Benchmark for Real-World Local Motion Deblurring
The Power of Prior: Training-Free Open-Vocabulary Semantic Segmentation with LLaVA
Post-training Feature Pruning for Fundus Images Classification
VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation
LoL: Longer than Longer, Scaling Video Generation to Hour
Bias at the End of the Score
Flow4DGS-SLAM: Optical Flow-Guided 4D Gaussian Splatting SLAM
SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution
Markovian Scale Prediction: A New Era of Visual Autoregressive Generation
Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning
EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval
IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding
CARLoS: Retrieval via Concise Assessment Representation of LoRAs at Scale
EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding
ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
POUR: A Provably Optimal Method for Unlearning Representation via Neural Collapse
Ground Reaction Inertial Poser: Physics-based Human Motion Capture from Sparse IMUs and Insole Pressure Sensors
Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
FHAvatar: Fast and High-Fidelity Reconstruction of Face-and-Hair Composable 3D Head Avatar from Few Casual Captures
BiPA: Bilevel Prompt Adaptation for Underwater Instance Segmentation
Homaloidal parametrization for detecting critical two-view configurations
Beyond Success: Refining Elegant Robot Manipulation from Mixed-Quality Data via Just-in-Time Intervention
SpeeDe3DGS: Speedy Deformable 3D Gaussian Splatting with Temporal Pruning and Motion Grouping
Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets
VAST: Video Ability‑Stratified Taxonomy for Data‑Efficient Video Reasoning
TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies
D2T2 - Multimodal Automated Planning for Brachytherapy
Semantic Foam: Unifying Spatial and Semantic Scene Decomposition
3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation
SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection
SOTA: Self-adaptive Optimal Transport for Zero-Shot Classification with Multiple Foundation Models
VisionLeaf: Entropy-Guided Leaf-First Reasoning for Efficient and Accurate Think-with-Image
Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation
Real2Sim2Real: RetinalDepth-64K for Depth Estimation in Posterior Segment Ophthalmic Surgery
SIR: Structured Image Representations for Explainable Robot Learning
OS-Fed: One Snapshot Is All You Need
Unified Generation and Self-Verification for Vision-Language Models via Advantage Decoupled Preference Optimization
CARD: Correlation Aware Restoration with Diffusion
BackSplit: The Importance of Sub-dividing the Background in Biomedical Lesion Segmentation
Cross-Architecture Adaptation: Cloud-Edge Continual Test-Time Adaptation with Dynamic Sampling and Heterogeneous Distillation
Semantic-Adaptive Diffusion for Dynamic Spatiotemporal Fusion
Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
Opti-NeuS: Neural Reconstruction for Dual-Layered Transparent and Opaque Objects
Hint2Gen: Bridging Understanding and Generation via Code-structured Hints
Focus, Don’t Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding
BHCast: Unlocking Black Hole Plasma Dynamics from a Single Blurry Image with Long-Term Forecasting
The Missing GAP: From Solving Square Jigsaw Puzzles to Handling Real World Archaeological Fragments
OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis
COT-FM: Cluster-wise Optimal Transport Flow Matching
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
SAR2Net: Learning Spatially Anchored Representations for Retrieval-Guided Cross-Stain Alignment
Dual-Level Hypergraph Generation for Addressing Feature Scarcity in Whole-Slide Image Classification
Generalizable Structure-Aware Keypoint Correspondence for Category-Unified 3D Single Object Tracking
Live Interactive Training for Video Segmentation
Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction
UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register
MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters
Learning Diffeomorphism for Medical Image Registration with Time-Embedded Architectures Using Semigroup Regularization
One Layer’s Trash is Another Layer’s Treasure: Adaptive Layer-wise Visual Token Selection in LVLMs
Learning Long-term Motion Embeddings for Efficient Kinematics Generation
Bias In, Bias Out? Finding Unbiased Subnetworks in Vanilla Models
MedFG-VQA: Low-Frequency Memory and Graph Attention for Lightweight Medical VQA
Streaming Video Instruction Tuning
Particulate: Feed-Forward 3D Object Articulation
Attack for Defense: Adversarial Agents for Point Prompt Optimization Empowering Segment Anything Model
Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs
Revisiting Learning with Noisy Labels: Active Forgetting and Noise Suppression
ChronoGS: Disentangling Invariants and Changes in Multi-Period Scenes
A Supervised Multi-task Framework for Joint cryo-ET Restoration Enabled by Generative Physical Simulation
Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection
AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects
Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection
RoSAMDepth: Robust Self-supervised Depth Estimation Leveraging Segment Anything Model
CG-Floor: Centroid-Guided Diffusion for Large-Scale Floorplan Generation
Adapting In-context Generation for Enhanced Composed Image Retrieval
NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
A Difference-in-Difference Approach to Detecting AI-Generated Images
Simple but Effective Triplet-Based Compression Strategies for Compact Visual Localization
TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos
Probabilistic Discrepancy Learning for Roadside LiDAR Scene Completion
D-Convexity: A Unified Differentiable Convex Shape Prior via Quasi-Concavity for Data-driven Image Segmentation
PromptLoop: Plug-and-Play Prompt Refinement via Latent Feedback for Diffusion Model Alignment
CF-IPT: Cross-Modal Fusion Interactive Prompt Tuning of Vision-Language Pre-Trained Model for Multisource Remote Sensing Data Classification
CompetitorFormer: Mitigating Query Conflicts for 3D Instance Segmentation via Competitive Strategy
AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models
SURF: Signature-Retained Fast Video Generation
X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection
Semantic Context Matters: Improving Conditioning for Autoregressive Models
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
StyleGallery: Training-free and Semantic-aware Personalized Style Transfer from Arbitrary Image References
Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events
fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding
SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning
Image Guides Images: Consistent Video Amodal Completion with Rectified In-Context Exemplar Guidance
Towards Multimodal Domain Generalization with Few Labels
OneHOI: Unifying Human-Object Interaction Generation and Editing
EvoID: Reinforced Evolution for Identity-Preserving Video Generation
Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP
HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models
FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection
Batman: Benign Knowledge Alignment Through Malicious Null Space in Federated Backdoor Attack
MatE: Material Extraction from Single-Image via Geometric Prior
Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data
Mechanisms of Object Localization in Vision–Language Models
Select, Hypothesize and Verify: Towards Verified Neuron Concept Interpretation
GenMatter: Perceiving Physical Objects with Generative Matter Models
IFCSR: Inference-Free Fidelity-Realism Control for One-Step Diffusion-based Real-World Image Super-Resolution
The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification
Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction: Dataset, Model and Benchmark
4D Local Modeling Toward Dynamic Global Perception for Ambiguity-free Rotation-Invariant Point Cloud Analysis
Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining
Adaptive Anisotropic Gaussian Splatting for Multi-contrast MRI Arbitrary-Scale Super-Resolution with Anatomy Guidance
HalluGen: Synthesizing Realistic and Controllable Hallucinations for Evaluating Image Restoration
NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization
LayoutAD: Exploring Semantic-Geometric Misalignment Reasoning for Scene Layout Anomaly Detection
FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation
MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images
GaussianMatch: Semi-Supervised Regression with Pseudo-Label Filtering via Multi-View Gaussian Consistency
Bi-directional Autoregressive Diffusion for Large Complex Motion Interpolation
P2GS: Physical Prior-guided Gaussian Splatting for Photometrically Consistent Urban Reconstruction
UNI-OOD: Unified Object- and Image-level Out-of-Distribution Detection via Cross-Context Attentive Vision-Language Modeling
Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval
INSID3: Training-Free In-Context Segmentation with DINOv3
MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning
IDESplat: Iterative Depth Probability Estimation for Generalizable 3D Gaussian Splatting
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
SRA 2: Variational Autoencoder Self-Representation Alignment for Efficient Diffusion Training
Semantic-Guided Global-Local Collaborative Prompt Learning for Few-Shot Class Incremental Learning
FM-Steer: Enhance Generalist Policies with Value-Guided Cascaded Denoising
MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification
SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting
Diff-SemiER: Transparency-Aware Adaptive Fusion Diffusion Model with Generative Prior for Semi-Transparent Eyeglasses Removal
RoboTAG: End-to-end Robot Pose Estimation via Topological Alignment Graph
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
Head-wise Adaptive Rotary Positional Encoding for Fine-Grained Image Generation
Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners
EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models
TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models
SO-Bench: A Structural Output Evaluation of Multimodal LLM
Eulerian Gaussian Splatting using Hashed Probability Pyramids
Real-Time Neural Video Compression with Unified Intra and Inter Coding
COPO: Causal-Oriented Policy Optimization for Hallucinations of MLLMs
Task-Aware Image Signal Processor for Advanced Visual Perception
UniLight: A Unified Representation for Lighting
Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation
UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking
Rethinking 2D-3D Registration: A Novel Network for High-Value Zone Selection and Representation Consistency Alignment
Lyapunov Probes for Hallucination Detection in Large Foundation Models
Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal
MedLIME: A Distribution-Aligned and Evidence-Supported Framework for Medical Saliency Explanations
R^2TUA: Reconstruction-residual Based Targeted and Untargeted Attack Against Text-Image Person Re-Identification
PvP: Data-Efficient Humanoid Robot Learning with Proprioceptive-Privileged Contrastive Representations
SAGE: Style-Adaptive Generalization for Privacy-Constrained Semantic Segmentation Across Domains
Sparsely Timing the Change: A Spiking Temporal Framework for Remote Sensing Interpretation
SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection
Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning
Rethinking Dataset Distillation: Hard Truths about Soft Labels
EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing
PoseMaster: A Unified 3D Native Framework for Stylized Pose Generation
Rascene: High-Fidelity 3D Scene Imaging with mmWave Communication Signals
GROW: Watermark Generation with Progressive Guidance for Diffusion Models
Transition Matching Distillation for Fast Video Generation
CIGPose: Causal Intervention Graph Neural Network for Whole-Body Pose Estimation
Adaptive Bayesian Early-Exit Networks for Efficient Non-Transferable Learning
SPE-MVS: Spatial Position Encoding Enhanced Multi-View Stereo with Monocular Depth Priors
VL-Eraser: Vacuum Distillation for Machine Unlearning in Vision-Language Models
SDUIE: Semi-Supervised Diffusion for Underwater Image Enhancement with Quant-Text Dual Control
Hybrid Robust Collaborative Perception with LiDAR-4D Radar Fusion under Adverse Weather Conditions
FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance
Bootstrapping Multi-view Learning for Test-time Noisy Correspondence
GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views
EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs
Hermite Radial Basis Function for Surface Reconstruction via Differentiable Rendering
3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding
Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images
CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection
M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation
Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding
V-DPM: 4D Video Reconstruction with Dynamic Point Maps
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation
Aesthetic Camera Viewpoint Suggestion with 3D Aesthetic Field
AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models
Hierarchical Process Reward Models are Symbolic Vision Learners
Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Spherical Images from Any Camera
UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition
GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization through EM-Guided Decomposition and Temporal Refinement
MER-Tracker: Towards High-Speed 3D Point Tracking via Multi-View Event-RGB Hybrid Cameras
RADAR: VQ-VAE Decoder of VAR is a Good Student for Restoring Against Degradation by Acceleration
BiGMINT: Biologically-guided Hierarchical Multimodal Integration for Modeling Multiple Compound Activities in Drug Discovery
Virtual Nodes Guided Dynamic Graph Neural Network for Brain Tumor Segmentation with Missing Modalities
Selective, Regularized, and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation
Annotation-Efficient Coreset Selection for Context-dependent Segmentation
Self-Consistency for LLM-Based Motion Trajectory Generation and Verification
GR-Gauge: Cost-efficient Training Configuration By Gauging the Gradient Redundancy
Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction
Hierarchically Robust Zero-shot Vision-language Models
ESAM++: Efficient Online 3D Perception on the Edge
Think Before You Drive: World Model-Inspired Multimodal Grounding
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
Text-Image Conditioned 3D Generation
PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image
S^2-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance
AudioStory: Generating Long-Form Narrative Audio with Large Language Models
PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis
CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video
Scene Reconstruction as Mapping Priors for 3D Detection
Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing
TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Dual-Level Scale-Oriented Contrast
Globscope: Toward a Global View of the Loss Landscape
Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models
MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models
Beyond Layer-Wise Merging: Chain-of-Merging for Vision-Language Models
AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
Self-Attention Driven Tensor Representation for High-Order Data Recovery
ReFAct: Empowering Multimodal Web Agents with Visual and Context Focusing
FedSDR: Federated Graph Learning with Structural Noise Detection and Reconstruction
GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics
NTK-Guided Implicit Neural Teaching
CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval
ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation
A Combination of Noise and Bilateral Filters Achieve Supralinear and Scalable Adversarial Robustness in CNNs
BriMA: Bridged Modality Adaptation for Multi-Modal Continual Action Quality Assessment
Distilling Quasi-Conformal Mapping: A Generalizable and Efficient Solution for Wide-Angle Correction
Lighting-grounded Video Generation with Renderer-based Agent Reasoning
ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis
GeoCoT: Towards Reliable Remote Sensing Reasoning with Manifold Perspective
Grounded Latents for Entity-Centric 4D Scene Generation
Deformation-based In-Context Learning for Point Cloud Understanding
Data-Centric Meta-Learning for Robust Few-Shot Generalization
Ultra-Fast Neural Video Compression
UNICBench: UNIfied Counting Benchmark for MLLM
CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think
M⁴-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection
Inter-Photon-Limited Videography
MHopReg: Efficient Hierarchical Multi-Hop Graph Search for Point Cloud Registration
Subspace Alignment for CLIP-based Continual Learning via Canonical Correlation Analysis
Inferring Compositional 4D Scenes without Ever Seeing One
CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning
On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models
TANGO: Text-Anchored Guided Optimization for Robust Fine-tuning Vision-Language Models under Label Noise
SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models
3D Space as a Scratchpad for Editable Text-to-Image Generation
UARE: A Unified Vision-Language Model for Image Quality Assessment, Restoration, and Enhancement
DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
VES-RFT: Rewarding Visual Evidence Sensitivity to Mitigate Hallucinations in Large Vision–Language Models
MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations
SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models
Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
Fusion of Depth and Semantics for Probabilistic Floorplan Localization
CROWn: A Unified Framework for Anti‑Aliased Downsampling and Phase‑Calibrated Fusion in 3D Medical Segmentation
EpiAgent: An Agent-Centric System for Ancient Inscription Restoration
MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models
3D-Object Perception Transformer (3PT)
OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments
InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions
SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models
FastGaMer: Efficient GainMap Learning for Practical Inverse Tone Mapping
RecTok: Reconstruction Distillation along Rectified Flow
Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
Neural Mixture Density Processes
Time Without Time: Pseudo-Temporal Representation for Space-Time Super-Resolution
FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning
Event Stream Filtering via Probability Flux Estimation
OpenDPR: Open-Vocabulary Change Detection via Vision-Centric Diffusion-Guided Prototype Retrieval for Remote Sensing Imagery
No Way To Steal My Face: Proactive Defense Against Identity-Preserving Personalized Generation
Temporal Representation Enhancement (TRE): Learning to Forget Dominant Patterns for Enhanced Temporal Spiking Features
Fixed Anchors Are Not Enough: Dynamic Retrieval and Persistent Homology for Dataset Distillation
Virtual Immunohistochemistry Staining with Dual-Aligned Multi-Task Feature Guidance
PAVAS: Physics-Aware Video-to-Audio Synthesis
MAD: Motion Appearance Decoupling for efficient Driving World Models
WalkGPT: Grounded Vision–Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation
SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation
AdapAction: Adaptive Target Action Backdoor Attack against GUI Agents
Exposing and Evaluating Hallucinations for GUI Grounding
Pose-guided Enriched Feature Learning for Federated-by-camera Person Re-identification
VideoWorld 2: Learning Transferable Knowledge from Real-world Videos
VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion
GeoSemba: Reconstructing State Space Model for Cross Paradigm Representation in Medical Image Segmentation
Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting
Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
EgoAVU: Egocentric Audio-Visual Understanding
CASPA: Graph-Structured Concept Anchors for Modality-Agnostic Adaptation in Vision–Language Models
OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control
MuM: Multi-View Masked Image Modeling for 3D Vision
MMVIP: A Visible-infrared Paired Dataset for Multi-weather Marine Vision
BiEvLight: Bi-level Learning of Task-Aware Event Refinement for Low-Light Image Enhancement
Towards Visual Query Localization in the 3D World
LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds
R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning
TopoHR: Hierarchical Centerline Representation for Cyclic Topology Reasoning in Driving Scenes with Point-to-Instance Relations
VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale
VidEoMT: Your ViT is Secretly Also a Video Segmentation Model
Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models
DemoFunGrasp: Universal Dexterous Functional Grasping via Demonstration-Editing Reinforcement Learning
SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time
MARCO: Navigating the Unseen Space of Semantic Correspondence
Residual Connections Harm Generative Representation Learning
Scaling Up AI-Generated Image Detection with Generator-Aware Prototypes
CineBrain: A Large-Scale Multi-Modal Audiovisual Brain Dataset for Brain-Conditioned Video Generation
Towards Cross-Modal Preservation, Consistency and Alignment for Privacy-Preserving Visible-Infrared Person Re-Identification
Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning
SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images
RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment
Scaling Parallel Sequence Models to Vision Foundation Models
More than the Sum: Panorama-Language Models for Adverse Omni-Scenes
Probing and Bridging Geometry–Interaction Cues for Affordance Reasoning in Vision Foundation Models
Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression
IPR-1: Interactive Physical Reasoner
Out of Sight, Out of Track: Adversarial Attacks on Propagation-based Multi-Object Trackers via Query State Manipulation
Spatial-Spectral Residuals Informed Diffusion Neural Operator for Pan-sharpening
ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding
ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval
Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding
VMD-FACT: A New Video Dataset and MLLM-based method for Detecting Realistic AI-Generated Video Misinformation
Vista4D: Video Reshooting with 4D Point Clouds
VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression
Source Models Leak What They Shouldn’t: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization
Hyperbolic Relational Prompts for Intersectional Fairness in Medical VLMs
TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification
TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction
OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis
UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits
LS-ViT: Least-Squares Hessian Based Block Reconstruction for Low-Bit Post-Training Quantization of Vision Transformers
CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation
VGG-T^3: Offline Feed-Forward 3D Reconstruction at Scale
CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics
Unblur-SLAM: Dense Neural SLAM for Blurry Inputs
POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation
Charge: A Comprehensive Novel View Synthesis Benchmark and Dataset to Bind Them All
Seeing Beyond 8bits: Subjective and Objective Quality Assessment of HDR-UGC Videos
Direction-aware 3D Large Multimodal Models
Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning
Hierarchical Action Learning for Weakly-Supervised Action Segmentation
SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing
MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model
Diffusion MRI Transformer with a Diffusion Space Rotary Positional Embedding (D-RoPE)
DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior
Partial Weakly-Supervised Oriented Object Detection
Time Blindness: Why Video-Language Models Can’t See What Humans Can?
When Robots Should Say ''I Don’t Know'': Benchmarking Abstention in Embodied Question Answering
Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes
Decoding 3D Perception via BrainSSD: Synergistic Fusion of EEG Representations from Static and Dynamic Visual Streams
cryoSENSE: Compressive Sensing Enables High-throughput Microscopy with Sparse and Generative Priors on the Protein Cryo-EM Image Manifold
PowerCLIP: Powerset Alignment for Contrastive Pre-Training
From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs
Monet: Reasoning in Latent Visual Space Beyond Image and Language
GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport
Expanding mmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets
Taming Sampling Perturbations with Variance Expansion Loss for Latent Diffusion Models
Spectral Super-Resolution via Adversarial Unfolding and Data-Driven Spectrum Regularization: From Multispectral Satellite Data to NASA Hyperspectral Image
Training-free Mixed-Resolution Latent Upsampling for Spatially Accelerated Diffusion Transformers
Multi-Metric Representation Learning Strategy Based on Clustering for Fine-Grained Multimodal Sentiment Analysis
ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization
Event6D: Event-based Novel Object 6D Pose Tracking
SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval
Stepwise Credit Assignment for GRPO on Flow-Matching Models
MatLat: Material Latent Space for PBR Texture Generation
x^2-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space
Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning
Image-to-Point Cloud Feature Back-Projection for Multimodal Training of 3D Semantic Segmentation
SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model
TRCoRSurg: Temporal-Relational Co-Reasoning for Surgical Video Triplet Recognition
Volumetric Functional Maps
Don’t Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs
Incremental Object Detection via Future-Aware Decoupled Cross-Head Distillation
Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning
DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation
SEA-Flow3D: Simplified, Efficient, and Accurate Scene Flow via Spatial Vector Sampling and Multi-scale Refinement
RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs
Generating Humanless Environment Walkthroughs from Egocentric Walking Tour Videos
InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space
Bi-Bridge: Bidirectional Diffusion Bridges for Low-Light Image Enhancement
Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction
StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation
DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer
MSPT: Efficient Large-Scale Physical Modeling via Parallelized Multi-Scale Attention
SVHalluc: Benchmarking Speech–Vision Hallucination in Audio-Visual Large Language Models
Convexity-Aware Noise Calibration: A Self-Supervised Framework for Noise-Level-Unknown Image Denoising
Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs
ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images
Illuminating Visual Identity in Universal Multimodal Embeddings
VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments
Event-Based Motion Deblurring Using Task-Oriented 3D Gaussian Event Representations
Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field
Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining
One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution
Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation
Reconstructing Spiking Neural Networks Using a Single Neuron with Autapses
HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis
Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition
VDOT: Efficient Unified Video Creation via Optimal Transport Distillation
Bridging Facial Understanding and Animation via Language Models
IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment
VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA
SARL-STG: A Spatially Aware Reinforcement Learning Framework for Refining MLLMs in Spatio-Temporal Video Grounding
RISE: Single Static Radar-based Indoor Scene Understanding
RAG-TP: A General Framework for Vehicle Trajectory Prediction via Retrieval-Augmented Generation
SkyReels-Text: Fine-Grained Font-Controllable Text Editing for Poster Design
SIGMA: A Physics-Based Benchmark for Gas Chimney Understanding in Seismic Images
TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation
OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion
Visual Diffusion Models are Geometric Solvers
BuildingGPT: Auto-Regressive Building Wireframe Reconstruction Model with Reinforcement Learning
SketchFaceGS: Real-Time Sketch-Driven Face Editing and Generation with Gaussian Splatting
Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation
Spatio-Temporal Conditional Denoising Transformer for Modality-Missing RGBT Tracking
ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models
Learning a Unified Latent Action Space from Videos with Action-centric Cycle Consistency
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
Grid Distillation: Compositional Image Distillation via Structured Generative Grids
Matching Every Pair to Track Every Point: PairFormer for All-Pairs Tracking and Video Trajectory Fields
Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion
Exploring Conditions for Diffusion Models in Robotic Control
Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction
Pose-Free Omnidirectional Gaussian Splatting for 360-Degree Videos with Consistent Depth Priors
Streamlined Open-Vocabulary Human-Object Interaction Detection
LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration
ObjectMorpher: 3D-Aware Image Editing via Deformable 3DGS
SANER: Switchable Adapter with Non-parametric Enhanced Routing for Person De-Reidentification
Latent Diffusion Inversion Requires Understanding the Latent Space
GS-ASM: 2DGS-Supervised Active Stereo Matching
AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention
SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models
Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control
SafeRoPE: Risk-specific Head-wise Embedding Rotation for Safe Generation in Rectified Flow Transformers
Interpretable and Steerable Concept Bottleneck Sparse Autoencoders
EV-CGNet: Co-visible Focused 3D-guided 2D Event Keypoint Detection Network
Token Warping Helps MLLMs Look from Nearby Viewpoints
RawMetaDiff: Unlocking Extreme Darkness from Dual-Exposure RAW with Meta-Guided Diffusion
Premier: Personalized Preference Modulation with Learnable User Embedding in Text-to-Image Generation
Elastic3D: Controllable Stereo Video Conversion with Guided Latent Decoding
gQIR: Generative Quanta Image Reconstruction
RDF-MIG: A Robust Diffusion Framework for Masked Image Generation to Augment Semantic Segmentation and Change Detection
Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache
Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception
MVLM: Template-Free Tracking via Vision–Language Margin Confidence and Memory-Gated Tracking
InTrain: Intrinsic Trainability for Zero-Cost Neural Architecture Search
Contrastive Cross-Bag Augmentation for Multiple Instance Learning-based Whole Slide Image Classification
Discover, Segment, and Select: A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation
Keep It Frozen: Domain-Routed Conditional Residual Modulation for Multi-Domain Vision Transformers
ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars
MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping
Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers
UniGeoRS: A Unified Benchmark for Tri-view Geo-Localization
Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration
Time-Aware One Step Diffusion Network for Real-World Image Super-Resolution
DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis
Dexterous World Models
Fast Markov Random Field Optimisation for Topologically Noisy 3D Shape Matching
Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
ReflexSplit: Single Image Reflection Separation via Layer Fusion-Separation
Medic-AD: Towards Medical Vision-Language Model's Clinical Intelligence
VENI: Variational Encoder for Natural Illumination
Image Diffusion Preview with Consistency Solver
MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration
UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection
SABER: Spatially Consistent 3D Universal Adversarial Objects for BEV Detectors
LightMover: Generative Light Movement with Color and Intensity Controls
Ego-1K – A Large-Scale Multiview Video Dataset for Egocentric Vision
SigLino: Efficient Multi-Teacher Distillation for Agglomerative Vision Foundation Models
Diffusion Sampling Path Tells More: An Efficient Plug-and-Play Strategy for Sample Filtering
GUI-SAGE: Enhancing GUI Automation with Self-Explanatory Learning
Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity
UniFusion: A Unified Image Fusion Framework with Robust Representation and Source-Aware Preservation
CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild
Demo2Tutorial: From Human Experience to Multimodal Software Tutorials
GazeShift: Unsupervised Gaze Estimation and Dataset for VR
Layered 4D-Rotor Gaussian Splatting: A Compressed Representation for Long Dynamic Scenes
First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models
OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios
Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters
ROSE: Rotate Your Large Language Model to See
Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement
VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
DBMSolver: A Training-free Diffusion Bridge Sampler for High-Quality Image-to-Image Translation
Interactive Tracking: A Human-in-the-Loop Paradigm with Memory-Augmented Adaptation
VT-Intrinsic: Physics-Based Decomposition of Reflectance and Shading using a Single Visible-Thermal Image Pair
FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips
Advancing Cancer Prognosis with Hierarchical Fusion of Genomic, Proteomic and Pathology Imaging Data from a Systems Biology Perspective
BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment
HFedATM: Hierarchical Federated Domain Generalization via Optimal Transport and Regularized Mean Aggregation
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Pixel Motion Diffusion is What We Need for Robot Control
LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving
RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution
Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism
UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation
Hyperbolic Prototype Learning with Uncertainty-Aware Consistency for Continual Test-Time Segmentation
SAGE: Scalable Agentic 3D Scene Generation for Embodied AI
WHU-MARS: A Multispectral Aerial-Ground Benchmark Towards Any-Scenario Person Re-Identification
Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers
VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm
DIMOS: Disentangling Instance-level Moving Object Segmentation
PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes
DreamSAC: Learning Hamiltonian World Models via Symmetry Exploration
ShapeR: Robust Conditional 3D Shape Generation from Casual Captures
Boosting Vision-Language Models Towards Cross-Domain Incremental Object Detection
BiGain: Unified Token Compression for Joint Generation and Classification
Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting
Say Cheese! Detail-Preserving Portrait Collection Generation via Natural Language Edits
Human Interaction-Aware 3D Reconstruction from a Single Image
Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding
SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
MOSAIC-GS: Monocular Scene Reconstruction via Advanced Initialization for Complex Dynamic Environments
ORBIT: Benchmarking SfM in the Wild with 360° Video
TriSim: Tri-Dimensional Similarity Modeling with Extreme Value Theory for False-Negative Mitigation in Remote Sensing Image-Text Retrieval
Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation
Causal Motion Diffusion Models for Autoregressive Motion Generation
Understanding and Enforcing Weight Disentanglement in Task Arithmetic
ORD: Object-Relation Decoupling for Generalized 3D Visual Grounding
FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation
HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction
Degradation-Consistent Test-Time Adaptation for All-in-One Image Restoration
Align While Search: Belief-Guided Exploratory Inference for World-Grounded Embodied Agents
UniDef: Universal Defense Against Unauthorized Image Manipulation
FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection
rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training
TUDSR: Twice Upsampling-Diffusion for Higher Super-Resolution
eRetinexGS: Retinex Modeling for Low-Light Scene Enhancement via Event Streams and 3D Gaussian Splatting
WebGym: Scaling Training Environments for Long-Horizon Visual Web Agents with Realistic Tasks
InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior
Designing to Forget: Deep Semi-parametric Models for Unlearning
PhotoFramer: Multi-modal Image Composition Instruction
Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation
OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens
LOREAL: Mitigating Low-Resolution Challenges in Vision-Language Models with Attribute-driven Prompt Self-Distillation
Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning
TrajTok: Learning Trajectory Tokens Enhances Video Understanding
StreamingTOM: Streaming Token Compression for Efficient Video Understanding
Cov2Pose: Leveraging Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation
MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts
Generative Point Tracking and Forecasting
HyperGait: Unleashing the Power of Parsing for Gait Recognition in the Wild via Hypergraph
LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens
Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
Stable and Efficient Single-Rollout RL for Multimodal Reasoning
Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models
Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens
BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation
V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties
Gyro-based Deep Video Deblurring
Learning Eigenstructures of Unstructured Data Manifolds
OntoAug: Rethinking Generative Data Augmentation via Ontology Guidance
Revisiting Optimal Coding for I-ToF under Practical Sensor Constraints
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images
VecGlypher: Unified Vector Glyph Generation with Language Models
DPAR: Dynamic Patchification for Efficient Autoregressive Visual Generation
IAFMNet: Information-Aware Feature Modulation for Efficient Super-Resolution
ViLearn: Accelerating Training Convergence of Image-to-3D Generation via Visibility Learning
PrivSynth: Alternating and Control-Based Optimization for Privacy and Utility in Synthetic Data
FreeScale: Scaling 3D Scenes via Certainty-Aware Free-View Generation
EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories
FlexiVideo: Variation-Aware Temporal Dynamics Modeling for Efficient Video Understanding
SAQN: Semantic-based Adaptive Query Network for 3D Referring Expression Segmentation
WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios
CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing
Seeing Through the Shift: Causality-Inspired Robust Generalized Category Discovery
PGA: Prior-free Generative Attack for Practical No-box Scenario
RewardFlow: Generate Images by Optimizing What You Reward
HCL-FF: Hierarchical and Contrastive Learning for Forward-Forward Algorithm
ViHOI: Human-Object Interaction Synthesis with Visual Priors
Unlocking Motion from Large Vision Models with a Semantic and Kinematic Duality for Gait Recognition
PARSE: Part-Aware Relational Spatial Modeling
Generative Diffusion Priors for 3D Mapping of the Dark Universe
SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation
GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization
UniPercept: A Unified Diffusion Model for Generalizable Visual Perception
Adapting a Pre-trained Single-Cell Foundation Model to Spatial Gene Expression Generation from Histology Images
GGPT: Geometry-Grounded Point Transformer
Probabilistic Precipitation Nowcasting with Rectified Flow Transformers
Exemplar-Free Continual Learning for State Space Models
Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
BrepVGAE: Variational Graph Autoencoder with Unified Latent Representation for B-rep
Multi-speaker Attention Alignment for Multimodal Social Interaction
OmniDocLayout: Towards Diverse Document Layout Generation via Coarse-to-Fine LLM Learning
Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding
HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics
DynFusion: Rethinking Condition Fusion for Adaptive Multi-Conditional Text-to-Image Generation
Composite-Attribute Person Re-Identification via Pose-Guided Disentanglement
YOLO-ULM: Ultra-Lightweight Models for Real-Time Object Detection
ProcessMaker: A Generalized Process Visualization Framework with Adaptive Sequence Steps on Diffusion Transformers
Anchoring the Mind of Multimodal Reasoners: Cognitive Bias as a Vector for Jailbreak Attacks
CD-Buffer: Complementary Dual-Buffer Framework for Test-Time Adaptation in Adverse Weather Object Detection
Med-CMR: A Fine-Grained Benchmark Integrating Visual Evidence and Clinical Logic for Medical Complex Multimodal Reasoning
Align Images Before You Generate
Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs
Bridge: Basis-Driven Causal Inference Marries VFMs for Domain Generalization
Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction
Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation
SpikeTrack: High-performance and Energy-efficient Event-Based Object Tracking with Spiking Neural Network
Stabilizing Streaming Video Geometry via Dynamic Feature Normalization
PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling
Gradient Knows Best: Mixed-Precision Quantization via Gradient-Guided Bit Allocation for Super-Resolution
FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-and-Language Navigation
MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction
A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space
GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking
PoseGaussian: 6D Pose Estimation for Unseen Objects via Sparse-View Object-Level 3D Gaussian Splatting
OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Video Understanding
FSLoRA: Harmonizing Detection and Re-Identification via Freq-Spatial Low-Rank Adapter for One-Stage Person Search
GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation
Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation
SparseOIT: Improving Order-Independent Transparency 3DGS via Active Set Method
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
Hypergraph-State Collaborative Reasoning for Multi-Object Tracking
AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models
ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation
RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark
SVBench: Evaluation of Video Generation Models on Social Reasoning
AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation
Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs
Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision
LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models
ODGS-SLAM: Omnidirectional Gaussian Splatting SLAM
EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors
Not All Birds Look The Same: Identity-Preserving Generation For Birds
Boosting Self-Supervised Tracking with Contextual Prompts and Noise Learning
EI-Part: Explode for Completion and Implode for Refinement
HierAmp: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation
InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields
Spatio-Temporal Difference Guided Motion Deblurring with the Complementary Vision Sensor
Defect Cue-Preserved Structural Feature Refinement for Few-Shot Anomaly Detection
RehearseVLA: Simulated Post-Training for VLAs with Physically-Consistent World Model
STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative
Diffusion Mental Averages
Predicting Spatial Transcriptomics from Histology Images via High-Order Multi-Cell Interaction Modeling
When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance
Linking Modality Isolation in Heterogeneous Collaborative Perception
CoWTracker: Tracking by Warping instead of Correlation
Masked Region Transformer for Layered Image Generation and Editing at Scale
SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation
Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection
LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing
Motus: A Unified Latent Action World Model
Prompt-Free Universal Region Proposal Network
Stabilizing Feature Geometry in Noisy Pretrained Models for Robust Downstream Tasks
Fast Spatial Tracking with Visual Geometry Transformer
SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation
Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs
Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern
LazyVAR: Accelerating Visual Autoregressive Models via Scale-wise Token Pruning and Parallel Group Decoding
Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation
VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking
Beyond Prompt Degradation: Prototype-guided Dual-pool Prompting for Incremental Object Detection
ARC Is a Vision Problem!
WeDetect: Fast Open-Vocabulary Object Detection as Retrieval
Polyphony: Diffusion-based Dual-Hand Action Segmentation with Alternating Vision Transformer and Semantic Conditioning
Guardians of the Hair: Rescuing Soft Boundaries in Depth, Stereo, and Novel Views
VisiLock: Authorizing Instruction-based Image editing with Dual Score Distillation
Complementary Prototype Mapping for Efficient Multimodal Anomaly Detection
Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training
Content-Adaptive Hierarchical Hyperprior for Neural Video Coding
Fractal Camouflage: A Bio-Inspired Approach for Multi-Scale Adversarial Attacks in the Infrared Domain
Finding Distributed Object-Centric Properties in Self-Supervised Transformers
DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation
SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens
All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models
MCHDoc: A Comprehensive Benchmark for Reading Multi-Carrier Chinese Historical Documents
RefTon: Reference person shot assist virtual Try-on
PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation
Cross-Slice Knowledge Transfer via Masked Multi-Modal Heterogeneous Graph Contrastive Learning for Spatial Gene Expression Inference
M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models
3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion
Love Me, Love My Label: Rethinking the Role of Labels in Prompt Retrieval for Visual In-Context Learning
Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration
ORSATR-X: A Foundation Model based on Differential-and-Excitation Networks for Optical Remote Sensing Object Recognition
EarlyTom: Early Token Compression Completes Fast Video Understanding
MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning
Thermally Activated Dual-Modal Adversarial Clothing against AI Surveillance Systems
ReaGEN: Adaptive Generation of Structured Chains-of-Thought for Efficient Multimodal Reasoning
Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization
G^2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
High Resolution Neural Video Coding with Bi-directional Confidence-Guided Reference Information Modeling
WorldStereo: Bridging Controllable Video Generation and Scene Reconstruction via 3D Geometric Memories
Hunting Normality from Query Sample via Residual Learning for Generalist Anomaly Detection
Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant and Efficient Point Cloud Analysis
MedKCO: Medical Vision-Language Pretraining via Knowledge-Driven Cognitive Orchestration
PRISM: Learning a Shared Primitive Space for Transferable Skeleton Action Representation
Diffusion-Based Makeup Transfer with Facial Region-Aware Makeup Features
SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning
Spherical Leech Quantization for Visual Tokenization and Generation
MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging
Simpleposter: A Simple Baseline For Product Poster Generation
Simple-ViLMedSAM: Simple Text Prompts Meet Vision-Language Models for Medical Image Segmentation
TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment
SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model
EmoThinker: Advancing Visual-Acoustic Emotion Analysis via Structural Token Selection and Chain-of-Thought Reasoning
Temporal Equilibrium MeanFlow: Bridging the Scale Gap for One-Step Generation
Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection
Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models
The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition
Self-Critical Distillation Network for Video-based Commonsense Captioning
MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy
Parse, Search, and Confirmation: Training-Free Aerial Vision-and-Dialog Navigation with Chain-of-Thought Reasoning and Structured Spatial Memory
Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers
WiTTA-Bench: Benchmarking Test-Time Adaptation for WiFi Sensing
TANGO: Learning Distribution-wise Foundation Prior Consistency and Instance-wise Style Calibration for Medical Image Generalization
SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals
FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy
A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models
Dense Metric Depth Completion from Sparse Direct Time-of-Flight Sensors
OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition
Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving
LitePT: Lighter Yet Stronger Point Transformer
Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer
TF-CADE: Foreground-Concentrated Text-Video Alignment for Zero-Shot Temporal Action Detection
Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation
Emergent Outlier View Rejection in Visual Geometry Grounded Transformers
AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation
CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment
Dynamic Logits Adjustment and Exploration for Test-Time Adaptation in Vision Language Models
WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments
WorldGen: From Text to Traversable and Interactive 3D Worlds
Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery
HAMMER: Harnessing MLLMs via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding
Masked-Diffusion Autoencoders for 3D Medical Vision Representation Learning
IF-Prune: Information-Flow Guided Token Pruning for Efficient Vision-Language Models
Curvature-Aware Zeroth-Order Optimization for Memory-Efficient Test-Time Adaptation
AHS: Adaptive Head Synthesis via Synthetic Data Augmentations
PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery
RealAppiance: Let High-fidelity Appliance Assets Controllable and Workable as Aligned Real Manauls
Mind the Hitch: Dynamic Calibration and Articulated Perception for Autonomous Trucks
SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation
EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling
Few-Shot Hybrid Incremental Learning: Continually Learning under Data Scarcity and Task Uncertainty
SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation
VEMamba: Efficient Isotropic Reconstruction of Volume Electron Microscopy with Axial-Lateral Consistent Mamba
HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing
FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models
Decouple Your Discovery and Memory in Continual Generalized Category Discovery
HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image
Learning Scene Coordinate Reconstruction from Unposed Images via Pose Graph Optimization
ULF-Loc: Unbiased Landmark Feature for Robust Visual Localization with 3D Gaussian Splatting
Sky2Ground: A Benchmark for Site Modeling under Varying Altitude
FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction
Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
Cross-domain Dual-stream Feature Disentanglement for Brain Disorder Prediction with Sparsely Labeled PET
Memory-Augmented Scene Understanding and Exploration for Open-World Aerial Object-Goal Navigation
Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence
Decoupling Vision and Language: Codebook Anchored Visual Adaptation
DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime
Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models
OmniZip: Learning a Unified and Lightweight Lossless Compressor for Multi-Modal Data
AdapTok: Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space
Revisiting Visual Corruptions in LVLMs: A Shape–Texture Perspective on Model Failures
Seeing Conversations: Communication Context Identification in Egocentric Video
Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
SAIDO: Generalizable Detection of AI-Generated Images via Scene-Aware and Importance-Guided Dynamic Optimization in Continual Learning
FLOW: Feature-Level Optimal Warping for Generalized Remote Physiological Measurement
LangField4D: Learning Identity-Adaptive and Spatio-Temporal Continuous 4D Language Fields for Dynamic Scenes
Occluded Human Body Capture with Frequency Domain Denoising Prior
ChartR: Evaluating Reasoning Accuracy and Robustness in Chart Question Answering
LoG3D: Ultra-High-Resolution 3D Shape Modeling via Local-to-Global Partitioning
CycleManip: Enabling Cycle-based Manipulation via Effective History Perception and Understanding
Elucidating the Design Space of Arbitrary-Noise-Based Diffusion Models
Beyond Text: Visual Description Assembly by Probabilistic Model for CLIP-based Weakly Supervised Semantic Segmentation
Decoupled and Reusable Adaptation for Efficient Cross-Modal Transfer
Olbedo: An Albedo and Shading Aerial Dataset for Large-Scale Outdoor Environments
IrisFP: Adversarial-Example-based Model Fingerprinting with Enhanced Uniqueness and Robustness
ELVIS: Enhance Low-Light for Video Instance Segmentation in the Dark
SkySense-VITA: Towards Universal In-context Segmentation of Multi-modal Remote Sensing Imagery
FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding
SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens
PEARL: Geometry Aligns Semantics for Training-Free Open-Vocabulary Semantic Segmentation
MMLandmarks: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding
SegGBC: Justifiable Coarse-to-Fine Granular-Ball Computing for Enhancing Clustering Image Segmentation
Hearing the Room Through the Shape of the Drum: Modal-Guided Sound Recovery from Multi-Point Surface Vibrations
RiskProp: Collision-Anchored Self-Supervised Risk Propagation For Early Accident Anticipation
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
TGSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion
FEAST: Fully Connected Expressive Attention for Spatial Transcriptomics
SonoWorld: From One Image to a 3D Audio-Visual Scene
AREA3D: Active Reconstruction Agent with Unified Feed-Forward 3D Perception and Vision-Language Guidance
CRFT: Consistent–Recurrent Feature Flow Transformer for Cross-Modal Image Registration
MMDIR: Multimodal Instruction-Driven Framework for Mixed-Degradation Document Image Restoration
Protect to Adapt: Subspace-Constrained Adaptation with Ranked Negative Prompt Feedback for Few-Shot Action Recognition
Towards High-resolution and Disentangled Reference-based Sketch Colorization
Detect Any AI-Counterfeited Text Image
Unifying Language-Action Understanding and Generation for Autonomous Driving
Dual Band Thermal Videography: Separating Time-Varying Reflection and Emission Near Ambient Conditions
LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents
Breaking the Continuum: Discrete Distribution Learning for Structural MRI Reconstruction
MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis
MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
H2-Surv: Hierarchical Hyperbolic Multimodal Representation Learning for Survival Prediction
DRAMA: Next-Gen Dynamic Orchestration for Resilient Multi-Agent Ecosystems in Flux
Local Motion Matters: A Deconstruct–Recompose Paradigm for Reinforcement Learning Pre-training from Videos
Haptic Neural Fields: Bringing Tactile Interactions to 3D Rendered Scenes
From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition
Selectively Extracting and Injecting Visual Attributes into Text-to-Image Models
StableMTL: Repurposing Latent Diffusion Models for Multi-Task Learning from Partially Annotated Synthetic Datasets
PointCSP: Cross-Sample Semantic Propagation and Stability Preservation in Self-Supervised Point Cloud Learning
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
FedSST: Rethinking Fair Federated Graph Learning under Structural Shift
Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation
FedARA: Resource-adaptive Low-rank Personalized Federated Learning via Anchor-driven Representation Alignment on Heterogeneous Edge Devices
Face-Guided Sentiment Boundary Enhancement for Weakly-Supervised Temporal Sentiment Localization
SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models
RECS4R: Bridging Semantics and Geometry for Referring Remote Sensing Interpretation
GOR-IS: 3D Gaussian Object Removal In the Intrinsic Space
Voxify3D: Pixel Art Meets Volumetric Rendering
EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer
IMU-HOI: A Symbiotic Framework for Coherent Human-Object Interaction and Motion Capture via Contact-Conscious Inertial Fusion
From Sketch to Fresco: Efficient Diffusion Transformer with Progressive Resolution
Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers
Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction
Target-Aware Invertible Encoder with Reconstruction Guidance for Infrared Small Target Detection
ModularAgent: A Task-Aware Modular Framework for Joint Optimization of Multimodal Large Language Models and World Models
Bridging Domains through Subspace-Aware Model Merging
Cross-View Distillation and Adaptive Masking for Incomplete Multi-View Multi-Label Classification
Selection-as-Nonlinearity: Bridging Attention and Activation via a Joint Game–Decision Lens for Interpretable, Discriminative Visual Representations
Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos
Rotation Invariant and Symmetry Aware Pixel Difference Network for Remote Sensing Object Detection
Language-Guided One-Step Diffusion Model for Nighttime Flare Removal
PnP-CM: Consistency Models as Plug-and-Play Priors for Inverse Problems
GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction
Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework
Hist2Style: Histogram-Guided Stylization with Bilateral Grids
What Matters in Practical Learned Image Compression
LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models
OctoT2I: A Self-Evolving Agentic Text-to-Image Router
R2G: A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII
Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Post-hoc Uncertainty Estimation
SplitFlux: Learning to Decouple Content and Style from a Single Image
SemanticVLA: Towards Semantic Reasoning over Action Memorization via Synergistic Explicit Trace and Latent Action Planning
SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling
L3DR: 3D-aware LiDAR Diffusion and Rectification
TAlignDiff: Automatic Tooth Alignment assisted by Diffusion-based Transformation Learning
MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data
Generalizable Sparse-View 3D Reconstruction from Unconstrained Images
Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering
Disentangling to Re-couple: Resolving the Similarity-Controllability Paradox in Subject-Driven Text-to-Image Generation
Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction
Iterative Closed-Loop Motion Synthesis for Scaling the Capabilities of Humanoid Control
When Lines Meet Textures: Spatial-Frequency Aligned Diffusion Features for Cross-Sparsity Correspondence
SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving
CDICS: Delving Into Fine-Grained Attribute for In-Context Segmentation via Compositional Prompts and Phased Decoupling
DreamOmni2: Multimodal Instruction-based Generation and Editing
ElasticFormer: Detecting Objects in HRW Shots via Elastic Computing Vision Transformer
TACO: Task-Aware Contrastive Learning for Joint LiDAR Localization and 3D Object Detection
JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization
Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning
WiseEdit: Benchmarking Cognition- and Creativity-Informed Image Editing
Detecting Unknown Objects via Energy-based Separation for Open World Object Detection
MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis
UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
PHANTOM: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
Similarity-Consistent Likelihood Diffusion enables Hidden Person Detection from Wall Reflections
UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resolution
When Anonymity Breaks: Identifying Models Behind Text-to-Image Leaderboards
ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction
VesMamba: 3D Pulmonary Vessel Segmentation from CT images via Mamba with Structural Perception and Scale-aware Filtering
Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction
FSFSplatter: Geometrically Accurate Reconstruction with Free Sparse-view Images within 2 minutes
Edit-aware RAW reconstruction
HySeg: Learning Generative Priors for Structure-Aware Remote Sensing Segmentation
View-Aware Semantic Alignment for Aerial-Ground Person Re-Identification
Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors
PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards
Enhancing Continual Learning of Vision-Language Models via Dynamic Prefix Weighting
STARFlow-V: End-to-End Video Generative Modeling with Autoregressive Normalizing Flows
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
A Geometric Algebra-Informed 3DGS Framework for Wireless Channel Prediction
RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation
FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs
PS-SR: Pseudo-Single-Step Video Super-Resolution via Speculative Diffusion
LA-Pose: Latent Action Pretraining Meets Pose Estimation
SelfHVD: Self-Supervised Handheld Video Deblurring
OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer
Controllable Federated Prompt Learning at Test Time
EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation
Bayesian Decomposition and Semantic Completion for Few-shot Semantic Segmentation
What’s Wrong with Synthetic Data for Scene Text Recognition? A Strong Synthetic Engine with Diverse Simulations and Self-Evolution
Motion-Aware Animatable Gaussian Avatars Deblurring
VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping
Learning to Identify Out-of-Distribution Objects for 3D LiDAR Anomaly Segmentation
ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets
Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos
LoST: Level of Semantics Tokenization for 3D Shapes
ProxyFL: A Proxy-Guided Framework for Federated Semi-Supervised Learning
MoVie: Broaden Your Views with Human Motion for Action Detection
MagicQuill V2: Precise and Interactive Image Editing with Layered Visual Cues
Score2Instruct: Scaling Up Video Quality-Centric Instructions via Automated Dimension Scoring
CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning
GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation
Cluster-aware Anchor Learning for Multi-View Clustering
Pano360: Perspective to Panoramic Vision with Geometric Consistency
iLRM: An Iterative Large 3D Reconstruction Model
HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
Distilling Balanced Knowledge from a Biased Teacher
Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos
Random Wins All: Rethinking Grouping Strategies for Vision Tokens
Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective
Seeing through boxes: Non-Line-of-Sight 3D Reconstruction from Radar Signals
NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity
GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance
PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
CoT-Edit: Let CoT Guide Instruction Video Editing
Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution Can Improve Robust Scene Graph Generation
SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild
MicroFM: Physics-guided Flow Matching for Isotropic Microscopy Reconstruction
Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection
HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars
SinGeo: Unlock Single Model's Potential for Robust Cross-View Geo-Localization
Spot The Ball: A Benchmark for Visual Social Inference
CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models
Bridging the Modality Gap in Compositional Zero-Shot Learning via Sparse Alignment and Unimodal Memory Bank
ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions
CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction
FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens
4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models
AXG-Reasoner: Error Detection and Explanation in Long Task Videos with Vision–Language Models
MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction
AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization
Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner
Breaking the 3D Dataset Bottleneck: Fast Scalable Generation of Aligned 3D Assets from Scratch for Category 6D Pose Estimation and Robotic Grasping
ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos
Learning by Analogy: A Causal Framework for Compositional Generalization
Omni-AD: A Large-scale and Versatile Benchmark for Industrial Anomaly Detection
SCE-Depth: A Spherical Compound Eye Framework for Wide FOV Depth Estimation
GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation
VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues
CHAL: Causal-guided Hierarchical Anomaly-aware Learning for Moving Infrared Small Target Detection
DeepProtect: Proactive Face-Swapping Defense using Identity Blending and Attribute Distortion
Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation
Scaling Spatial and Temporal Context for Robotic Imitation Learning Policies With Scene Graphs
Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures
Modeling the Brain’s Grammar: ROI-Guided fMRI Pretraining for Transferable and Interpretable Vision Decoding
Tracking by Predicting 3-D Gaussians Over Time
HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance
OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation
UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models
Diffusion Guided Chain-of-Vision for Large Autoregressive Vision Models
WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation
Protego: User-Centric Pose-Invariant Privacy Protection Against Face Recognition-Induced Digital Footprint Exposure
LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models
Steering Where to Diffuse: Generative Modeling of Phenotypic Response Simulation with Steered Diffusion Bridge
3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image
EduDiag: A Benchmark for Educational Diagnostic Reasoning with Error Tracing and Correction on Large Multimodal Models
CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image
AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots
Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts
Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection
GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials
RINO: Rotation-Invariant Non-Rigid Correspondences
Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species
Zoo3D: Zero-Shot 3D Object Detection at Scene Level
FilterGS: Traversal-Free Parallel Filtering and Adaptive Shrinking for Large-Scale LoD 3D Gaussian Splatting
Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery
Structural Graph Probing of Vision–Language Models
LEADER: Learning Reliable Local-to-Global Correspondences for LiDAR Relocalization
Robustness Under Data Scarcity: Few-Shot Continual Adversarial Training for Evolving Threats
Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs
DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers
HiDRA: Hierarchical Degradation Representation and Adaptation with Generative Priors for Enhancing Infrared Vision
Omni-MMSI: Toward Identity-attributed Social Interaction Understanding
SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models
Fed-ADE: Adaptive Learning Rate for Federated Post-adaptation under Distribution Shift
Multi-Scale Gaussian-Language Map for Zero-shot Embodied Navigation and Reasoning
Spatiotemporal Pyramid Flow Matching for Climate Emulation
Stable Spike: Dual Consistency Optimization via Bitwise AND Operations for Spiking Neural Networks
H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers
Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
Orthogonal Spatial-Aware Multi-View Anchor Graph Clustering for Incomplete Remote Sensing Data
PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion
DMGD: Train-Free Dataset Distillation with Semantic-Distribution Matching in Diffusion Models
FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models
TrajRAG: Retrieving Geometric-Semantic Experience for Zero-Shot Object Navigation
DRM: Diffusion-based Reward Model With Step-wise Guidance
TruckDrive: Long-Range Autonomous Highway Driving Dataset
MVP: Multiple View Prediction Improves GUI Grounding
TESO: Online Tracking of Essential Matrix by Stochastic Optimization
DuoGen: Towards Autonomous Interleaved Multimodal Generation
AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions
Diff4Splat: Repurposing Video Diffusion Models for Dynamic Scene Generation
Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures
OPRO: Orthogonal Panel-Relative Operators for Panel-Aware In-Context Image Generation
Improving Controllable Generation: Faster Training and Better Performance via x0-Supervision
ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data
HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling
DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers
EgoSound: Benchmarking Sound Understanding in Egocentric Videos
SCoRe: Salience-Coverage Reduction for Vision Token Pruning in Vision-Language Models
SemLayer: Semantic-aware Generative Segmentation and Layer Construction for Abstract Icons
ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion
Universal-to-Specific: Dynamic Knowledge-Guided Multiple Instance Learning for Few-Shot Whole Slide Image Classification
Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior
Ani3DHuman: Photorealistic 3D Human Animation with Self-guided Stochastic Sampling
Plug-and-Play PDE Optimization for 3D Gaussian Splatting: Toward High-Quality Rendering and Reconstruction
A Debiased Reconstruction-based Framework for Training-Free Detection of AI-Generated Images
CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization
Single-step Diffusion-based Video Coding with Semantic-Temporal Guidance
Learning to Control Physically-simulated 3D Characters via Generating and Mimicking 2D Motions
DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction
Inference-time Physics Alignment of Video Generative Models with Latent World Models
HiFi-BRep: High-Fidelity Latent Representation for Robust B-Rep Generation
Deciphering Genotype-Phenotype Mechanisms from High-Content Profiling via Knowledge-Guided Multi-modal Graph Learning
ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation
TAP: A Token-Adaptive Predictor Framework for Training-Free Diffusion Acceleration
MambaSIC: Mamba-based Stereo Image Compression with Bi-directional Multi-reference Entropy Model
GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis
Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models
Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models
RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video
ZeroIDIR: Zero-Reference Illumination Degradation Image Restoration with Perturbed Consistency Diffusion Models
3D-LATTE: Latent Space 3D Editing from Textual Instructions
Active Intelligence in Video Avatars via Closed-loop World Modeling
AToken: A Unified Tokenizer for Vision
ClusterMark: Towards Robust Watermarking for Autoregressive Image Generators with Visual Token Clustering
CaT-GS: Efficient 3DGS Rendering for Large Scale Scenes via Inter-frame Caching and Tile Scheduling
Mask to Align, Weight to Disambiguate: Reliable Unsupervised Cross-Modal Hashing with Masked-Weight Contrast
Is Parameter Isolation Better for Prompt-Based Continual Learning?
Affordance Field Intervention: Enabling VLAs to Escape Memory Traps in Robotic Manipulation
TV2TV: A Unified Framework for Interleaved Language and Video Generation
CMR-RD: Long-Tailed Adaptive VLM for Explainable CMR Diagnosis
CAR-SAM: Cross-Attention Reconstruction for Post-Training Quantization of the Segment Anything Model
ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference
GenSplat: Bridging the Generalization Gap in 3DGS Language Comprehension
Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling
Muses: Designing, Composing, Generating Nonexistent Fantasy 3D Creatures without Training
AnthroTAP: Learning Point Tracking with Real-World Motion
VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset
Learning complete and explainable visual representations from itemized text supervision
Rounded or Streamlined Head? Bridging Concept Bottleneck Models and Attribute-Described Object Parts
Spectrum from Defocus: Fast Spectral Imaging with Chromatic Focal Stack
One Algorithm to Align Them All
What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models
Stake the Points: Structure-Faithful Instance Unlearning
Learning to See through Illumination Extremes with Event Streaming in Multimodal Large Language Models
Thermal is Always Wild: Characterizing and Addressing Challenges in Thermal-Only Novel View Synthesis
CountGD++: Generalized Prompting for Open-World Counting
RMIR: A Benchmark Dataset for Reasoning-Intensive Multimodal Image Retrieval
ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting
The Universal Normal Embedding
Reflection Separation from a Single Image via Joint Latent Diffusion
Towards Robust Sequential Decomposition for Complex Image Editing
Coupled Diffusion Sampling for Training-Free Multi-View Image Editing
CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation
HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models
SEA: Evaluating Sketch Abstraction Efficiency via Element-level Commonsense Visual Question Answering
Emergent Extreme-View Geometry in 3D Foundation Models
Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning
MeshSplatting: Differentiable Rendering with Opaque Meshes
Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
AudioAvatar: Personalized Audio-driven Whole-body Talking Avatars
CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis
EMMA: Extracting Multiple physical parameters from Multimodal Data
VoDaSuRe: A Large-Scale Dataset Revealing Domain Shift in Volumetric Super-Resolution
Physically Inspired Gaussian Splatting for HDR Novel View Synthesis
mmWaveFlow: Unified Enhancement and Generation of mmWave Human Point Clouds
From Attraction to Equilibrium: Physics-Inspired Semantic Gravitons for Zero-Shot Anomaly Detection
Match-and-Fuse: Consistent Generation from Unstructured Image Sets
VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis
CUBic: Coordinated Unified Bimanual Perception and Control Framework
Continual Learning for fMRI-Based Brain Disorder Diagnosis via Functional Connectivity Matrices Generative Replay
Flow Matching for Multimodal Distributions
PromptDepth: Efficient and Promptable Geometric 3D Vision Model for Embodied Intelligence
DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration
ParticleGS: Learning Neural Gaussian Particle Dynamics from Videos for Prior-free Physical Motion Extrapolation
FILTR: Extracting Topological Features from Pretrained 3D Models
CG-Reasoner: Centroid-Guided Positional Reasoning Segmentation for Medical Imaging with a Robust Visual-Text Consistency Metric
CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning
Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation
Garments2Look: A Multi-Reference Dataset for High-Fidelity Outfit-Level Virtual Try-On with Clothing and Accessories
Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training
Lumosaic: Hyperspectral Video via Active Illumination and Coded-Exposure Pixels
Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval
SpikeTrack: A Spike-driven Framework for Efficient Visual Tracking
Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction
RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos
AntiStyler: Defending Object Detection Models Against Adversarial Patch Attacks Using Style Removal
CausalLens: Sensitivity-Guided Multi-Head Causal Intervention for Hallucination Mitigation in Large Vision-Language Models
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following
PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting
Push-and-Step: From RL-Based Balance Recovery to Physical Simulation of Dense Crowds
Generalized and Personalized Federated Learning with Black-Box Foundation Models via Orthogonal Transformations
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes
Teaching DINOv3 About Partial 3D Geometry: A Self-Supervised Geometry-Aware Approach
LensWalk: Agentic Video Understanding by Planning How You See in Videos
MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture
Factorize, Reconstruct, Enhance: A Unified Framework for Multimodal Sentiment Analysis
Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping
WAM-Flow: Parallel Coarse-to-Fine Motion Planning via Discrete Flow Matching for Autonomous Driving
FedAlign: Differentially Private Distribution Alignment for Non-IID Federated Learning
PPM-CLIP: Probabilistic Prompt Modeling for Generalizable AI-Generated Image Detection
Roots Beneath the Cut: Uncovering the Risk of Concept Recovery in Pruning-Based Unlearning for Diffusion Models
Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting
BD-Merging: Bias-Aware Dynamic Model Merging with Evidence-Guided Contrastive Learning
Statistical Characteristic-Guided Denoising for Rapid High-Resolution Transmission Electron Microscopy Imaging
Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection
It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models
tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction
Residual Primitive Fitting of 3D Shapes with SuperFrusta
Illustrator’s Depth: Monocular Layer Index Prediction for Image Decomposition
MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation
GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
Unified Vector Floorplan Generation via Markup Representation
CREward: A Type-Specific Creativity Reward Model
Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation
Bidirectional Multimodal Prompt Learning with Scale-Aware Training for Few-Shot Multi-Class Anomaly Detection
SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping
Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision
E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training
WRIVINDER: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery
Learning Latent Proxies for Controllable Single-Image Relighting
TopoSlide: Topologically-Informed Histopathology Whole Slide Image Representation Learning
DRiffusion: Draft-and-Refine Process Parallelizes Diffusion Models with Ease
Personalized Image Descriptions from Attention Sequences
Specificity-aware reinforcement learning for fine-grained open-world classification
Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection
Scaling Zero-Shot Reference-to-Video Generation
MS-Temba: Multi-Scale Temporal Mamba for Understanding Long Untrimmed Videos
Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing
A Temporal and Content Co-Awareness Latent Diffusion for Controllable Hand Image Generation
ReasonEdit: Towards Reasoning-Enhanced Image Editing Models
PosterOmni: Generalized Artistic Poster Creation via Task Distillation and Unified Reward Feedback
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios
Accelerating Diffusion Model Training under Minimal Budgets: A Condensation-Based Perspective
EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing
Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization
C-GenReg: Training-Free 3D Point Cloud Registration by Multi-View-Consistent Geometry-to-Image Generation with Probabilistic Modalities Fusion
FEAT: Fashion Editing and Try-On from Any Design
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
How to Take a Memorable Picture? Empowering Users with Actionable Feedback
DeltaQuant: 4-bit Video Diffusion Models with Spatiotemporal Delta Smoothing
InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity
ID-Sim: An Identity-Focused Similarity Metric
VQ-VA World: Towards High-Quality Visual Question-Visual Answering
Talking Together: Synthesizing Co-Located 3D Conversations from Audio
Hi-Lo Prune: Look at What You'll Lose before Pruning with Hierarchical Token Selection
NIL: No-data Imitation Learning
Unsupervised Multi-agent and Single-agent Perception from Cooperative Views
BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
Latent Implicit Visual Reasoning
Adapter Shield: A Unified Framework with Built-in Authentication for Preventing Unauthorized Zero-Shot Image-to-Image Generation
SAME: Sparse and Anchored Model Editing for Heterogeneous Incremental Learning under Limited Data
BAMI: Training-Free Bias Mitigation in GUI Grounding
SpotEdit: Selective Region Editing in Diffusion Transformers
Spatial-SAM: Spatially Consistent 3D Electron Microscopy Segmentation with SDF Memory and Semi-Supervised Learning
Unsupervised Multi-Scale Segmentation of 3D Subcellular World with Stable Diffusion Foundation Model
HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles
VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes
TTRV: Test-Time Reinforcement Learning for Vision Language Models
OctoNav: Towards Generalist Embodied Navigation
PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
A Mixed Diet Makes DINO An Omnivorous Vision Encoder
Relightful Video Portrait Harmonization
VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
E-3DPSM: A State Machine for Event-based Egocentric 3D Human Pose Estimation
Archon: A Unified Multimodal Model for Holistic Digital Human Generation
Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
Toward Generalizable Whole Brain Representations with High-Resolution Light-Sheet Data
MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation
Intervention-Aware Multiscale Representation Learning from Imaging Phenomics and Perturbation Transcriptomics
Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction
SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker
From Scale to Speed: Adaptive Test-Time Scaling for Image Editing
StyleTextGen: Style-Conditioned Multilingual Scene Text Generation
Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering
HumanBA: Human-Aware Bundle Adjustment via Global Human-Camera Decoupling
A Faster Path to Continual Learning
Gau-Occ: Geometry-Completed Gaussians for Multi-Modal 3D Occupancy Prediction
UniGame: Turning a Unified Multimodal Model Into Its Own Adversary
Multigrain-aware Semantic Prototype Scanning and Tri-Token Prompt Learning Embraced High-Order RWKV for Pan-Sharpening
Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness
Language-Free Generative Editing from One Visual Example
Query2Uncertainty: Robust Uncertainty Quantification and Calibration for 3D Object Detection under Distribution Shift
Anti-I2V: Safeguarding your Photos from Malicious Image-to-video Generation
Back to Basics: Let Denoising Generative Models Denoise
Improved Mean Flows: On the Challenges of Fastforward Generative Models
Bidirectional Normalizing Flow: From Data to Noise and Back
Enhancing Out-of-Distribution Detection with Extended Logit Normalization
Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI
Thinking in 360°: Humanoid Visual Search in the Wild
Cloning Deterministic Worlds: The Critical Role of Latent Geometry in Long-Horizon World Models
Mamba Learns in Context: Structure-Aware Domain Generalization for Multi-Task Point Cloud Understanding
The Geometry of Robustness: Optimizing Loss Landscape Curvature and Feature Manifold Alignment for Robust Finetuning of Vision-Language Models
YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection
A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection
Extend3D: Town-Scale 3D Generation
Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations
INSIGHT Bench: Towards Grounded IN-SItu Guidance for Robotic ManipulaTion
Language-driven Fine-grained Retrieval
StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
SpatialTree: How Spatial Intelligence Branches Out in MLLMs
DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos
ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation
EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions
Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models
HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation
Self-Evaluation Unlocks Any-Step Text-to-Image Generation
Transition Models: Rethinking the Generative Learning Objective
UniSER: A Foundation Model for Unified Soft Effects Removal
FusionAgent: A Multimodal Agent with Dynamic Model Selection for Human Recognition
Gaze Target Estimation Anywhere with Concepts
Exploring Adaptive Masked Reconstruction for Self-Supervised Skeleton-Based Action Recognition
PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction
Critical Patch-Aware Sparse Prompting with Decoupled Training for Continual Learning on the Edge
Visual-RRT: Finding Paths toward Visual-Goals via Differentiable Rendering
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
Same Attention, Different Truths: Put Logit-Lens over Visual Attention to Detect and Mitigate LVLM Object Hallucination
From Where Things Are to What They Are For: Benchmarking Spatial–Functional Intelligence in Multimodal LLMs
Prototype-as-Prompt: Multimodal Sentiment Prototypes Endowing Large Language Models the Capability to Perform Multimodal Sentiment Analysis
TrafficAlign: Aligning Large Language Models for Traffic Scenario Generation
Solvability of the Viewing Graph Under the Affine Camera Model
MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents
Computational Speckle Pattern Interferometry
MERLIN: Building Low-SNR Robust Multimodal LLMs for Electromagnetic Signals
Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution
An Efficient Token Compression Framework for Visual Object Tracking
Property-Informed Diffusion-Based Text-to-Microstructure Generation
Physical Object Understanding with a Physically Controllable World Model
Content-Aware Dynamic Patchification for Efficient Video Diffusion
SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding
SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching
Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation
Adaptive Confidence Regularization for Multimodal Failure Detection
DEVA: Fine-tuning Multimodal Large Language Models for Visual Perception Tasks
Memory-Efficient Fine-Tuning Diffusion Transformers via Dynamic Patch Sampling and Block Skipping
Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification
STAR-R1: Multi-View Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs
DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing
iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation
DocPrune: Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning
NERFIFY: A Multi-Agent Framework for Turning NeRF Papers into Code
MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence
SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Make it SING: Analyzing Semantic Invariants in Classifiers
Pointing at Parts: Training-Free Few-Shot Grounding in Multimodal LLMs
Revisiting Model Stitching In the Foundation Model Era
Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation
Kaleidoscopic Scintillation Event Imaging
Beyond Global Similarity: Multi-Conditional Retrieval for Fine-Grained Cross-Modal Understanding
Cross-Modal Guided Visual Synthesis for Data-Efficient Multimodal Depression Recognition
LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment
Learning Multi-View Spatial Reasoning from Cross-View Relations
Learning from Oblivion: Predicting Knowledge-Overflowed Weights via Retrodiction of Forgetting
EDGS: Eliminating Densification for Efficient Convergence of 3DGS
Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation
NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
Exploring Visual Pretraining for Learning Language Intelligence
Drift-Resilient Temporal Priors for Visual Tracking
GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks
BLMT-Stereo: Breaking the Local Minima Trap of Iterative Stereo Matching
SwiftNDC: Fast Neural Depth Correction for High-Fidelity 3D Reconstruction
ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction
GauSDF: Signed Distance Embedded Gaussian Surfels for 3D Reconstruction
4D E-SloMo: 4D Reconstruction for High Speed Scene using a Hybrid RGB-Event Multi-View System
MADrive: Memory-Augmented Driving Scene Modeling
OnlineX: Unified Online 3D Reconstruction and Understanding with Active-to-Stable State Evolution
Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection
SyncTrack4D: Cross-Video Motion Alignment and Video Synchronization with Multi-Video 4D Gaussian Splatting
Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning
AR4D: Autoregressive 4D Generation from Monocular Videos
CLLAP: Contrastive Learning-based LiDAR-Augmented Pretraining for Enhanced Radar-Camera Fusion
Point2Gaussian: Point-Cloud-to-Gaussian Conversion for Efficient 3D Scene Rendering
Speed3R: Sparse Feed-forward 3D Reconstruction Models
FF3R: Feedforward Feature 3D Reconstruction from Unconstrained views
Generalizable Human Gaussian Splatting via Multi-view Semantic Consistency
HEDA: Hyperbolic-Euclidean Dual Adaptation for Robust Real-World Point Cloud Completion
WildAni4D: Towards 4D Animal Mesh Reconstruction
Instant Colorization of Gaussian Splats
Dynamic Scene Decomposition Beyond Moving Objects for High-Fidelity 3D Reconstruction in Autonomous Driving
LangFlash: Feed-forward 3D Language Gaussian Splatting from Sparse Unposed Images
Prospective Dynamic 3D MRI Reconstruction via Latent-Space Motion Tracking from Single Measurement
FACT-GS: Frequency-Aligned Complexity-Aware Texture Reparameterization for 2D Gaussian Splatting
Affine Bases for Affine Spaces
DynamicTree: Interactive Real Tree Animation via Sparse Voxel Spectrum
Improving Densification in 3D Gaussian Splatting for High-Fidelity Rendering
OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation
GeoFusion-CAD: Structure-Aware Diffusion with Geometric State Space for Parametric 3D Design
MMGait: Towards Multi-Modal Gait Recognition
SPIDER: Spatial Image CorresponDence Estimator for Robust Calibration
GEAR: GEometry-Motion Alternating Refinement for Articulated Object Modeling with Gaussian Splatting
GlowGS: Generative Semantic Feature Learning for 3D Gaussian Splatting in Nighttime Glow Scenes
FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution
Affordance-First Decomposition for Continual Learning in Video–Language Understanding
2D Triangle Splatting for Direct Differentiable Mesh Training
Beyond What's Shared: Recovering Lost Unique Information from Intermediate Layers to Boost Multimodal Geo-Foundation Models
ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph
DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images via Diffusion Transformers
3D Gaussian Splatting for Annular Dark Field Scanning Transmission Electron Microscopy Tomography Reconstruction
IDSplat: Instance-Decomposed 3D Gaussian Splatting for Driving Scenes
Nonlinear Color Transfer via Learnable Bezier Flows
Stream3D: Streaming Zero-Shot 3D Instance Segmentation with Multi-View Noise Mask Filtering and Manifold Refining
Active Exploration for Sparse Visual Localization
DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
GRVS: a Generalizable and Recurrent Approach to Monocular Dynamic View Synthesis
Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning
3DFA: Aligning the Features Between Point Cloud and Query Image for Scene-Specific Visual Localization
GenErase: Generalizable and Semantically-Aware Concept Erasure in Diffusion Models
Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction
DABO: Difficulty-Aware Bayesian Optimization with Diffusion-Learned Priors
Native3D: End-to-End 3D Scene Generation via Unified Mesh-Texture Modeling and Semantic Alignment
From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images
CaliTex: Geometry-Calibrated Attention for View-Coherent 3D Texture Generation
NeVStereo: A NeRF-Driven NVS-Stereo Architecture for High-Fidelity 3D Tasks
VGA: Empowering Aerial-Ground Localization by Visual Geometry Alignment
VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction
AvatarMix: Identity-Preserving Cross-Avatar Composition for Outfit Personalization
Object Pose Transformer: Unifying Unseen Object Pose Estimation
SwiftVGGT: A Scalable Visual Geometry Grounded Transformer for Large-Scale Scenes
CATRF: Codec-Adaptive TriPlane Radiance Fields for Volumetric Content Delivery
VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection
PDF-GS: Progressive Distractor Filtering for Robust 3D Gaussian Splatting
Three-Step Conditional Diffusion 3D Reconstruction for Light-Field Microscopy
LTGS: Long-Term Gaussian Scene Chronology From Sparse View Updates
DASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples
Learning a Particle Dynamics Model with Real-World Videos
VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation
Softmax-GS: Generalized Gaussians Learning When to Blend or Bound
Scaling Multi-Identity Consistency for Image Customization via Multi-to-Multi Matching Paradigm
G2I: Transitioning a Generalized Monocular Depth Estimation Model to In-Domain Metric Depth Prediction
TextOVSR: Text-Guided Real-World Opera Video Super-Resolution
Fast SceneScript: Fast and Accurate Language‑Based 3D Scene Understanding via Multi‑Token Prediction
3D-RE-GEN: 3D Reconstruction of Indoor Scenes with a Generative Framework
SignPR: A Progressive Vector-Quantized Diffusion Framework for Sign Language Production
HiDiGen: Hierarchical Diffusion for B-Rep Generation with Explicit Topological Constraints
From 3D Pose to Prose: Biomechanics-Grounded Vision–Language Coaching
RiGS: Rigid-aware 4D Gaussian Splatting from a Single Monocular Video
Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis
RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes
PrivateEyes: Gaze-Preserving Anonymization for Data Sharing
FedAFD: Multimodal Federated Learning via Adversarial Fusion and Distillation
WGS: Watertight Geometry Standardization for Scalable 3D Generation
Self-Evolving 3D Scene Generation from a Single Image
Distill Any Depth: Distillation Creates a Stronger Monocular Depth Estimator
Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting
HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks
UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
UniVerse3D: Emerging Properties of Unified Multimodal Models in 3D Understanding and Generation
A Causal Marriage between VLM and IRM from Understanding to Reasoning
HumanOrbit: 3D Human Reconstruction as 360° Orbit Generation
Beyond Voxel 3D Editing : Learning from 3D Masks and Self-Constructed Data
GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models
Adversarial Agents: Black-Box Evasion Attacks with Reinforcement Learning
OneThinker: All-in-one Reasoning Model for Image and Video
Defending CLIP via Noise-Induced Feature Dynamics for Training-Free, Zero-shot Adversarial Robustness
Jailbreaking Frontier Foundation Models Through Intention Deception
NSGuard: Null-Space Guided Robust Watermarking for Data Copyright Protection in Customized Generation
NoOVD: Novel Category Discovery and Embedding for Open-Vocabulary Object Detection
Chain of World: World Model Thinking in Latent Motion
A Robust Out-of-Distribution Detection Framework via Synergistic Smoothing
Language-Grounded Decoupled Action Representation for Robotic Manipulation
Phantasia: Context-Adaptive Backdoors in Vision Language Models
BadRSSD: Backdoor Attacks on Regularized Self-Supervised Diffusion Models
HandX: Scaling Bimanual Motion and Interaction Generation
CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion
BadVLM: Towards Efficient and Resilient Backdoor Attacks on Large Vision-Language Models
ProjFlow: Projection Sampling with Flow Matching for Zero‑Shot Exact Spatial Motion Control
Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models
PRUE: A Practical Recipe for Field Boundary Segmentation at Scale
Breaking the Illusion: Consensus-Based Generative Mitigation of Adversarial Illusions in Multi-Modal Embeddings
Egocentric Visibility-Aware Human Pose Estimation
The Surprising Effectiveness of Noise Pretraining for Implicit Neural Representations
When Data is Scarce, Learn to Adapt: Robust Federated Learning via Adversarial Meta-Optimization
Robust Alignment: Harmonizing Clean Accuracy and Adversarial Robustness in Adversarial Training
DRA: Structure-Preserving Backdoor Erasure via Diagnosing, Recalibrating, and Adapting
iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception
TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures
APC: Transferable and Efficient Adversarial Point Counterattack for Robust 3D Point Cloud Recognition
Cognitive Attack Detection in Augmented Reality (CADAR): A Neuro-Symbolic Approach with Particle Filtering on Perception Graphs
On Evaluating Stateful Defence Models against Query-Based Black-Box Attacks
Optimizing Certified Radius of Zero-shot Composed Image Retrieval via Text Guidance
When Interpretability Becomes a Liability: Adversarial Attacks on CBM Concept Layers
Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks
Phantom: A Unified Face-Swap Deepfake Protection Framework with Latent and Spatial Constraints
Tap, Scan, Exploit: The Hidden Vulnerabilities of Everyday QR Codes
DeepFakeShield: A Proactive Defense Against Malicious Face Swapping
MDG: Masked Denoising Generation for Multi-Agent Behavior Modeling in Traffic Environments
LiDAR-to-4D Radar Synthesis for Building Large-Scale Tensor Datasets
UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision
SurfaceGS: Dynamic Surface Gaussian Splatting for Urban Driving Scenes
GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation
JACoP: Joint Alignment for Compliant Multi-Agent Prediction
Lifting Unlabeled Internet-level Data for 3D Scene Understanding
Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving
Block-based Learned Image Compression without Blocking Artifacts
What Matters for Scalable and Robust Learning in End-to-End Driving Planners?
Physics-Informed Reward Framework for Vision-Language Driven Safe Autonomous Driving
HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes
VESPA: Open-World Auto-Labeling for 3D Object Detection in Autonomous Driving
Hear you are: Teaching LLMs Spatial Reasoning with Vision and Spatial Sound
IRL-VLA: Vision-Language-Action Training via Reward World Model
KnowMTP: A Knowledge-Guided Framework for Multi-Agent Trajectory Prediction in Autonomous Driving
MapGPT: A Vision-Language Model for Large-Scale High-Definition Map Generation
CryoHype: Reconstructing a thousand cryo-EM structures with transformer-based hypernetworks
RoaD: Rollouts as Demonstrations for Closed-Loop Supervised Fine-Tuning of Autonomous Driving Policies
P-Flow: Prompting Visual Effects Generation
3D Gaussian Splatting from Unposed Spike Stream
PAVE: An End-to-End Dataset for Production Autonomous Vehicle Evaluation
RoadTones: Tone Controllable Text Generation from Road Event Videos
CLIP-like Model as a Foundational Density Ratio Estimator
GRADE: Guiding Realistic Autonomous Driving with Adaptive Trajectory Evolution
SurfelOcc: Self-supervised Occupancy Prediction via 2D Surfel Splatting
dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning
CompBench: Benchmarking Complex Instruction-guided Image Editing
Choreographing a World of Dynamic Objects
Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models
GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling
Learning Vision-Language-Action World Models for Autonomous Driving
HybridDriveVLA: Vision-Language-Action Model with Visual CoT reasoning and ToT Evaluation for Autonomous Driving
Training-free, Perceptually Consistent Low-Resolution Previews with High-Resolution Image for Efficient Workflows of Diffusion Models
AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models
Quantized Residuals to Continuous Prompts for Few-Shot Class Incremental Learning in Vision-Language Models
EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images
Pseudo-Expert Regularized Offline RL for End-to-End Autonomous Driving in Photorealistic Closed-Loop Environments
AD-GBC: Anisotropic Granular-Ball Skip-Connection Refiner for UNet-Based Medical Image Segmentation
White-Balance First, Adjust Later: Cross-Camera Color Constancy via Vision-Language Evaluation
OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving
Reallocating Attention Across Layers to Reduce Multimodal Hallucination
CoRT-Predictor: Chain of Risk Thought Autoregressive Trajectory Predictor for Autonomous Driving
REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting
TALON: Test-time Adaptive Learning for On-the-Fly Category Discovery
C^2T: Captioning-Structure and LLM-Aligned Common-Sense Reward Learning for Traffic–Vehicle Coordination
RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward
PEARL: A Lightweight Prompt-based Feature Interpreter Framework for Real-Time, Anonymous, and Heterogeneous Collaborative Perception
DreamSR: Towards Ultra-High-Resolution Image Super-Resolution via a Receptive-Field Enhanced Diffusion Transformer
Variable-View Diffusion with Geometric Uncertainty Unlocks LiDAR Upsampling
LongStream: Long-Sequence Streaming Autoregressive Visual Geometry
VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models
RQR3D: Reparametrizing the regression targets for BEV-based 3D object detection
Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers
On the Feasibility and Opportunity of Autoregressive 3D Object Detection
Boundary-Responsive Differentiable Gating for Superpixel-Based Segmentation
See Tomorrow, Act Today: Foresight-Driven Autonomous Driving
Rethinking Asymmetric Quantization: Hidden Symmetry in Vision Model Weights
Hierarchical Enhancement of Semantic Priors for Disentangled Text-Driven Motion Generation
Spatial Transcriptomics as Images for Large-Scale Pretraining
MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark
DiffGradCAM: A Class Activation Map Using the Full Model Decision to Solve Unaddressed Adversarial Attacks
Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models
Physical Simulator In-the-Loop Video Generation
Modality-Aware and Anatomical Vector-Quantized Autoencoding for Multimodal Brain MRI
ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors
Fingerprint Fragment Expansion using Image Outpainting Approach based on Spectral Normalization PatchGAN
JoPPO: Hierarchical Photography Assessment via Contrastive Joint Conditional Probabilistic Reinforcement Learning
Improving Autoregressive Image Generation Through Coarse-to-Fine Token Prediction
Fine-VAD: Towards Fine-Grained Video Anomaly Detection via Progressive Cross-Granularity Learning
Functional Mean Flow in Hilbert Space
Intelligent Photo Retouching with Language Model-Based Artist Agents
CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention
Guided Lensless Polarization Imaging
One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework
Towards Robust Multi-Modal Semantic Segmentation with Teacher-Student Framework and Hybrid Prototype Distillation
Blockwise Divide-and-Aggregate for Image Restoration using Diffusion Priors
SPEGC: Continual Test-Time Adaptation via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation
TF-SSD: A Strong Pipeline via Synergic Mask Filter for Training-free Co-salient Object Detection
ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting
Vocabulary Scaling Law: Tuning Open-vocabulary Predictors for Their Openness
Towards Imperceptible Watermarking Via Environment Illumination for Consumer Cameras
The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment
Adaptive Continuous Kernel Networks for Image Reconstruction from Non-Uniform Sampling
StyleDoctor: Towards Specialist Reward Model for Style-centric Generation Tasks
FreqAdapt: Frequency-Adaptive Processing for RAW Object Detection
Stability and Non-Local Modeling in Hybrid Convolution–Transformer Networks for Snapshot Hyperspectral Reconstruction
Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention
Breaking Degradation Coupling: A Structural Entropy–Guided Decoupled Framework and Benchmark for Infrared Enhancement
From Events to Clarity: The Event-Guided Diffusion Framework for Dehazing
Fast Generative DeOcclusion for Visual Geometry and Robotics
A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy
FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation
MAMMA: Markerless Accurate Multi-person Motion Acquisition
High-Quality and Efficient Turbulence Mitigation with Events
Unlocking Single-View Constraints for Efficient Camera Relocalization with Keypoint-Level Multi-View Geometric Consistency in Training
HeroGS: Hierarchical Guidance for Robust 3D Gaussian Splatting under Sparse Views
Evolve Vision-Language-Action Model into an Agent with On-the-fly Tool-use
Bidirectional Query-Driven Generation of Parametric CAD Sketch
Retrieval-VLA: Training-Free In-Context Adaptation for Vision-Language-Action Models
SRGCD: Stability-Driven Region Growth Framework for 3D Change Detection
Revisiting Articulated Parts Perception in Robot Manipulation
SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting
Re^2MoGen: Open-Vocabulary Motion Generation via LLM Reasoning and Physics-Aware Refinement
Dual-Estimator: Decoupling Global and Local Semantic Shift for Drift Compensation in Class-Incremental Learning
SAM 3D Body: Robust Full-Body Human Mesh Recovery
Teleoperation, Simulation, or Human Video? Data Utilization Law for Robot Manipulation
Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting
ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-Based 3D Scene Understanding
World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models
OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding
RoboTransfer: Controllable Geometry-Consistent Video Diffusion for Manipulation Policy Transfer
Multi-Modal Image Fusion via Intervention-Stable Feature Learning
Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration
Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis
Switch-JustDance: Benchmarking Whole-Body Motion Tracking Controllers Using a Commercial Console Game
AVION: Aerial Vision–Language Instruction from Offline Teacher to Prompt-Tuned Network
Efficient and Training-Free Single-Image Diffusion Models
OSA: Echocardiography Video Segmentation via Orthogonalized State Update and Anatomical Prior-aware Feature Enhancement
OminiMAG-SLAM : Unified Online Dual Graph Optimization for Multi-Agent Gaussian SLAM
VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh Recovery
PhysIR-Splat: Physically Consistent Thermal Infrared Radiative Transfer in 3D Gaussian Splatting
ReaAct: Bridging Robotic Reasoning and Action Generation Toward Real-World Spatial Generalization
Learning Multi-Task Robot Trajectory Segmentation from Visual and Kinematic Streams
LIBERO-Plus: A Progressive Robustness Benchmark for Visual-Language-Action Models
AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion
LP3: LLM-based Potential Prediction Policy for Object Navigation using a Scene-Object Semantic Map
Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling
RoboScape-R: Unified Reward-Observation World Models for Generalizable Robotics Training via RL
UniVerse: Empower Unified Generation with Reasoning and Knowledge
CoTFly: Making UAVs Think Where to Fly Next Through Visual Chain-of-Thought Reasoning
Gated KalmaNet: A Fading Memory Layer through Test-time Ridge Regression
RU4D-SLAM: Reweighting Uncertainty in Gaussian Splatting SLAM for 4D Scene Reconstruction
AeroAgent: A Vision–Physics–Decision Framework for Aerodynamic Vehicle Design
Reviving ConvNeXt for Efficient Convolutional Diffusion Models
A1: Adaptive Truncated Vision-Language-Action Model from Affordance to Action
Rethinking Token Reduction for Large Vision-Language Models
Learning from Semantic Dictionaries: Discriminative Codebook Contrastive Learning for Unified Visual Representation and Generation
Ego-Pi: VLA Fine-Tuning for Ego-Centric Human and Robot Data
RACE-6D: Real-time Accurate Coarse-to-finE Object 6D Pose Transformer
ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training
From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings
SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving
MiVLA: Towards Generalizable Vision-Language-Action Model with Human-Robot Mutual Imitation Pre-training
UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes
ShiftLUT: Spatial Shift Enhanced Look-Up Tables for Efficient Image Restoration
Riemannian Score-Based Diffusion for Language-Conditioned Grasp and Affordance Detection
DINO-VO: Learning Where to Focus for Enhanced State Estimation
VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation
Plug-and-Play Incomplete Multi-View Clustering via Janus-Faced Affinity Learning with Topology Harmonization
HoneyBee: Data Recipes for Vision-Language Reasoners
Temporally-Smooth Global Bundle Adjustment for Real-Time Dense Visual SLAM
RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval
Masked Next-Scale Prediction For Self-Supervised Scene Text Recognition
PointThinker: Point-Incentivized Parallel Thinking for Multimodal Large Language Model
iTCTSL: Interpretable Tropical Cyclone Track and Intensity Forecasting via Task Sensitive Learning
CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space
TokenHand: Discrete Token Representation for Efficient Hand Mesh Reconstruction
Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration
Machine Vision-Oriented Appearance Design: Generate Natural And Robust Textures For 3D Meshes
Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images
Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization
Unlocking Token Rewards via Training-Free Reward Attribution
Bridge Your Fields: MeteoNet for Efficient Non-Uniform Meteorological Field Reconstruction
Anchoring and Rescaling Attention for Semantically Coherent Inbetweening
Accelerating Autoregressive Video Diffusion via History-Guided Cache and Residual Correction
Catalyst: Out-of-Distribution Detection via Elastic Scaling
Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning
LUMINA: Learning and Understanding of Multimodal Information for Narrative and Affect-based Virality Prediction
Frequency-domain Manipulation for Face Obfuscation
Personalized Federated Training of Diffusion Models with Privacy Guarantees
LOOPE: Learnable Optimal Patch Order for Positional Encoders in Vision Transformers
Learning What Helps: Task-Aligned Context Selection for Vision Tasks
TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering
Watermarking Matters for Deepfake Detection: A Proactive Method for Detecting Forgeries under Conventional Attacks
Breaking the Regional Perception Bottleneck of Multimodal Large Language Models via External Reasoning Framework
Scaling Spatial Intelligence with Multimodal Foundation Models
CTFS : Collaborative Teacher Framework for Forward-Looking Sonar Image Semantic Segmentation with Extremely Limited Labels
Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
TPTransformer: Tensor–Tensor Product Transformer for Hyperspectral Image Super-Resolution
Semantic Audio-Visual Navigation in Continuous Environments
NeuroRule: Bridging Vision and Logic with Differentiable Rule Induction
Co-Adaptive Graph Learning Through Coupled Spectral Refinement for 3D Anomaly Detection
Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation
The Mechanics of CNN Filtering with Rectification
Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots
AndroidLong: LLM-based Android Agents Struggle with Long Looping Tasks
Harnessing Chain-of-Thought Reasoning in Multimodal Large Language Models for Face Anti-Spoofing
DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training
Multimodal Large Language Models as Image Classifiers
HTTM: Head-wise Temporal Token Merging for Faster VGGT
Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation
Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models
GauMVC: Generative Decoupled Gaussian Representation for Human-centric Multi-view Video Compression
Learning from Synthetic Data via Provenance-Based Input Gradient Guidance
InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models
Exploring the best way for UAV visual localization under Low-altitude Multi-view Observation Condition: a Benchmark
ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions
Benchmarking Layout-Guided Diffusion Models through Unified Semantic-Spatial Evaluation in Closed and Open Settings
Rethinking Occlusion Modeling for UAV Tracking
DA-VAE: Plug-in Latent Compression for Diffusion via Detail Alignment
Exploring Spatial Intelligence from a Generative Perspective
RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes
Abstract 3D Perception for Spatial Intelligence in Vision-Language Models
CLP: A Real-World Dataset of Contaminated Lens Protectors for Robust Semantic Segmentation
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
Beyond Appearance: Camouflaged Object Detection via Geometric Structure
Hidden Monotonicity: Explaining Deep Neural Networks via their DC Decomposition
Vision Language Models are Confused Tourists
LenghuSky-8: An 8-Year All-Sky Cloud Dataset with Star-Aware Masks and Alt-Az Calibration for Segmentation and Nowcasting
WaDi: Weight Direction-aware Distillation for One-step Image Synthesis
Through the PRISM: Principle-Aware, Interpretable, and Multi-Scale Evaluation of Visual Designs
Polarization State Tracing for Reflection Removal and Color-Consistent Reconstruction
AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Multimodal Models
GenHOI: Towards Object-Consistent Hand–Object Interaction with Temporally Balanced and Spatially Selective Object Injection
VIAFormer: Voxel-Image Alignment Transformer for High-Fidelity Voxel Refinement
Name That Part: 3D Part Segmentation and Naming
MV-TAP: Tracking Any Point in Multi-View Videos
Do MLLMs Exhibit Human-like Perceptual Behaviors? HVSBench: A Benchmark for MLLM Alignment with Human Perceptual Behavior
PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction
End-to-End Language-Action Model for Humanoid Whole Body Control
Memorization in 3D Shape Generation: An Empirical Study
Delta Rectified Flow Sampling for Text-to-Image Editing
Shape and Texture Recognition in Large Vision-Language Models
EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy
Reliable Policy Transfer for Safety-Aware End-to-End Driving with Deep Reinforcement Learning
Beyond the Ground Truth: Enhanced Supervision for Image Restoration
U-SEG: Uncertainty in SEGmentation - A systematic multi-variable exploration
HyCal: A Training-Free Prototype Calibration Method for Cross-Discipline Few-Shot Class-Incremental Learning
UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics
ForeHOI: Feed-forward 3D Object Reconstruction from Daily Hand-Object Interaction Videos
GOVTrack: Towards Generative Open-Vocabulary Multi-Object Tracking
BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird’s-Eye View Images
Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision
ArchSym: Detecting 3D-Grounded Architectural Symmetries in the Wild
Forensic-Friendly Image Manipulation via Controllable Latent Diffusion
ResCa: Residual Caching for Diffusion Transformers Acceleration
Towards Text-Guided Attribute-Disentangled Multimodal Representation Learning
RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization
The DeepSpeak Dataset
Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective
Suppressing Non-Semantic Noise in Masked Image Modeling Representations
AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents
Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation
A2Z-10M+: Geometric Deep Learning with A-to-Z BRep Annotations for AI-Assisted CAD Modeling and Reverse Engineering
PRISM: Prototype-based Reasoning with Inter-modal Semantic Mining for Interpretable Image Recognition
MolRecBench-Wild: A Real-World Benchmark for Optical Chemical Structure Recognition
Event Structural Valley: A Unified Theoretical and Practical Framework for Event Camera Autofocus
JRM: Joint Reconstruction Model for Multiple Objects without Alignment
Collaborative Multi-Mode Pruning for Vision-Language Models
FinChart-Multimodal: A Dataset for Context-Injected Financial Chart Understanding with Aligned OHLCV Time Series
EG-3DVG: Expression and Geometry Aware Grounding Decoder for 3D Visual Grounding
THEval. Evaluation Framework for Talking Head Video Generation
Synergistic Bleeding Region and Point Detection in Laparoscopic Surgical Videos
PolyReal: A Benchmark for Real-World Polymer Science Workflows
CICA: Coupling Confidence-Aware Pretraining with Confidence-Informed Attention for Robust Multimodal Sentiment Analysis
OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models
LiDAR Prompted Spatio-Temporal Multi-View Stereo for Autonomous Driving
PureSpace: A Benchmark for Abstract Spatial Reasoning in Vision-Language Models
Zero-Shot Image Denoising via Hybrid Prior-Guided Pseudo Sample Generation
Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models
Beyond 3D Geometry: M3FD, a Large-Scale Dataset and Benchmark for Multimodal 3D Perceptual Understanding
CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
CoIn: Coverage and Informativeness-Guided Token Reduction for Efficient Large Multimodal Models
Paper2SysArch: Structure‑Constrained System Architecture Generation from Scientific Papers
EVA: Efficient Reinforcement Learning for End-to-End Video Agent
ADSeeker: A Knowledge-Grounded Reasoning Framework for Industry Anomaly Detection and Reasoning
WildRelight: A Real-World Dataset and Benchmark for Single-Image Relighting
EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
FlashVSR: Towards Real-time Diffusion-Based Streaming Video Super Resolution
When Local Rules Create Global Order: Self-Organized Representation Learning for Latent Diffusion Models
MeshFlow: Efficient Artistic Mesh Generation via MeshVAE and Flow-based Diffusion Transformer
VibraVerse: A Large-Scale Geometry–Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning
A Unified Framework for Knowledge Transfer in Bidirectional Model Scaling
CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation
When Harmful Content Goes Invisible: Unveiling Perception Failure of LVLMs with CAMOUHARMTI
Humanoid Generative Pre-Training for Zero-Shot Motion Tracking
Step-CoT: Stepwise Visual Chain-of-Thought for Medical Visual Question Answering
SenseSearch: Empowering Vision-Language Models with High-Resolution Agentic Search-Reasoning via Reinforcement Learning
Real-IAD MVN: A Multi-View Normal Vector Dataset and Benchmark for High-Fidelity Industrial Anomaly Detection
SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes
Stochastic Ray Tracing for the Reconstruction of 3D Gaussian Splatting
The Unwritten Benchmark: A New Challenge for Multimodal Machine Learning in Abstract Perceptual Reasoning
Unified Camera Positional Encoding for Controlled Video Generation
Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning
Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media
FUN REC Reconstructing Functional 3D Scenes from Egocentric Interaction Videos
MathAll: A Real-World Benchmark for Mathematical Reasoning and Cross-Modal Understanding Evaluation in Omni-MLLMs
Driving on Registers
Safe-LLaVA: A Privacy-Preserving Vision Language Dataset and Benchmark for Biometric Safety
Twin-T & TwintVQA: A Reliable Structure–Detail Separating VLM and a Comprehensive Benchmark for Chart and Table Tasks
DR-DPO: Dual-Regularized DPO for Efficient Dataset Condensation
NanoSD: Edge Efficient Foundation Model for Real Time Image Restoration
DrawingVQA: A Real-World Benchmark for Multi-Depth Visual–Textual Reasoning on Construction Drawings
Teacher-Guided Routing for Sparse Vision Mixture-of-Experts
SciPostGen: Bridging the Gap between Scientific Papers and Poster Layouts
Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection
Decoupled Generative Modeling for Human-Object Interaction Synthesis
SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation
Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs
From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis
Towards Reliable Human Evaluations in Gesture Generation: Insights from a Community-Driven State-of-the-Art Benchmark
Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans
RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting
SuperGlasses: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses
PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement
PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories
InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models
TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation
VEBench: Benchmarking Large Multimodal Models for Real-world Video Editing
RAVEN: Erasing Invisible Watermarks via Novel View Synthesis
Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis
CrowdVerse: A Bidirectional Reality-Calibrated Benchmark for Crowd Understanding and Simulation
Temporal Interaction in Spiking Transformers with Multi-Delay Mixer
Can Language Models Understand mmWave Data? Benchmarking Large Language Models for mmWave Radar-Based Human Understanding
Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation
Attribute-Preserving Pseudo-Labeling for Diffusion-Based Face Swapping
Efficient Equivariant Transformer for Self-Driving Agent Modeling
From Static Snapshots to Dynamic Trajectories: Evaluating and Enhancing the Learning Pathways of Multimodal Large Language Models
Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding
Evaluating Dataset Watermarking for Fine-Tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach
Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation
BMD-45: A Large-Scale CCTV Vehicle Detection Dataset for Urban Traffic in Developing Cities
Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models
QueryMe: Query-Driven Open-Vocabulary 3D Object Affordances Grounding from Multimodal Evidence
SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers
DREAM: Document Recognition with Explicit Adaptive Memory
Splatwizard: A Benchmark Toolkit for 3D Gaussian Splatting Compression
Geometric Neural Distance Fields for Learning Human Motion Priors
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
FARMER: Flow AutoRegressive Transformer over Pixels
CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
Unlocking ImageNet’s Multi-Object Nature: Automated Large-Scale Multilabel Annotation
DENALI: A Dataset Enabling Non-Line-of-Sight Spatial Reasoning with Low-Cost LiDARs
CATS-V2V: A Real-World Vehicle-to-Vehicle Cooperative Perception Dataset with Complex Adverse Traffic Scenarios
Neural Dynamic GI: Random-Access Neural Compression for Temporal Lightmaps in Dynamic Lighting Environments
Seeing the Abstract: A Benchmark for Visual-Only Metaphor Understanding in Multimodal Large Language Models
PaQ-DETR: Learning Pattern and Quality-Aware Dynamic Queries for Object Detection
Cross-Dimensional Forgery Pattern Extraction for Generalizable Forgery Localization Framework
Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
Reliable Test-time Adaptation Via Evidential Uncertainty Modeling in Vision–Language Models
Seeing What Matters: Visual Preference Policy Optimization for Visual Generation
SPOT: Sparsification with Attention Dynamics via Token Relevance in Vision Transformers
OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery
Do LLMs and VLMs Share Reasoning Neurons? Evidence and Mechanisms of Cross-Modal Transfer
MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models
HamiPose: Hamiltonian Optimization for Unsupervised Domain Adaptive Pose Estimation
Debiased One-Shot NAS Via Density-Aware Sampling
CGU-Bayes: Causal Graph Uncertainty-Guided Bayesian Inference for Domain Generalization
PSLIF: A Primary-Supplementary LIF Neuron for Spiking Neural Networks
Activation Matters: Test-time Activated Negative Labels for OOD Detection with Vision-Language Models
3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding
SHIELD: Secure Hypernetworks for Incremental Expansion Learning Defense
CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection
Eigen-Value: Efficient Domain-Robust Data Valuation Via Eigenvalue-Based Approach
Repurposing 3D Generative Model for Autoregressive Layout Generation
In2CLR: Joint Intra-Inter Curriculum Learning with Review for Degraded Fake Image Detection
RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations
HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet
Global Information Thresholding for Sufficient and Necessary Circuits
Latent Domain Modeling Improves Robustness to Geographic Shifts
M3Grounder: Mask-Based Multi-Span and Multi-Granular Grounding for Document QA
Any-Class Presence Likelihood for Robust Multi-Label Classification with Abundant Negative Data
Towards Persistence: Learning Topological Constraints for Event-based Small Object Detection
GeoFree-CoSeg: Unsupervised Point Cloud-Image Cross-Modal Co-Segmentation Without Geometric Alignment
VideoMatGen: PBR Materials through Joint Generative Modeling
Any Resolution Any Geometry: From Multi-View To Multi-Patch
TransKV: A Data-Driven Pruning Method for Large Foundation Models
Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding
Rich Feature Learning via Diversification
MultiCrafter: High-Fidelity Multi-Subject Generation via Disentangled Attention and Identity-Aware Preference Alignment
Benchmarking PhD-Level Coding in 3D Geometric Computer Vision
Image Classification Using CNN-QNN Hybrid Model with Optimized Correlated Features
SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation
SpiderCam: Low-Power Snapshot Depth from Differential Defocus
Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark
Dual Strategies for Test-Time Adaptation
Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation
VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction
CPUBone: Efficient Vision Backbone Design for Devices with Low Parallelization Capabilities
Leveraging Multispectral Sensors for Color Correction in Mobile Cameras
LiveGesture: Streamable Co-Speech Gesture Generation Model
FLToM: Robust Federated Learning with Theory-of-Mind Structure
SMAP: Semantic Route Planning with Map-Grounded Multimodal Alignment
FedCVC: Federated Primal-Dual Learning with Client-Driven Virtual Compensation for Mitigating Dual Drift
Affine Perspective-Three-Point Problem
Refaçade: Editing Object with Given Reference Texture
Q-MambaIR: Accurate Quantized Mamba for Efficient Image Restoration
Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks
ThinkGen: Generalized Thinking for Visual Generation
Mitigating The Distribution Shift of Diffusion-based Dataset Distillation
PHATE-Net: Differentiable Pseudotime Learning for Trustworthy Disease Trajectories in PET
MoEActok: A MoE-based Action Tokenizer for Vision-Language-Action Models
PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer
ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering
Qinling-GFFE: A Novel Station-based Benchmark and Graph-Frequency Fusion Enhancer for Precipitation Forecasting
Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining
Deep Feedback ConvNets by Embedding the Working Memory Module for Image Classification
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
Channel Correlation Loss for Binary Neural Networks
Cross-Domain Demo-to-Code via Neurosymbolic Counterfactual Reasoning
Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow
MegAD: An Expert in Meta-Learning Guided Few-Shot Anomaly Detection
Long-Tail Internet Photo Reconstruction
Write Where It Matters: Policy-Guided Watermarks for 3D Gaussian Splatting
SGST-Transformer: A Spherical Geometry-Aware Spatio-Temporal Transformer for 360° Video Saliency Prediction
From Drops to Grid: Noise-Aware Spatio-Temporal Neural Process for Rainfall Estimation
MoCha: End-to-End Video Character Replacement without Structural Guidance
Faster-GS: Analyzing and Improving Gaussian Splatting Optimization
X-band Radar Non-Line-of-Sight Imaging
AdaPerceiver: Transformers with Adaptive Width, Depth, and Tokens
Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification
OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar
Texture-Guided Multiscale Cross-Modal Fusion for AI-Generated Image Quality Assessment
Anatomica: Localized Control over Geometric and Topological Properties for Anatomical Diffusion Models
Res2SPDNet: Multi-Granularity SPD Matrix Residual Learning for Signal Classification
Envisioning the Future, One Step at a Time
Imbalanced View Contribution Evaluation and Refinement for Deep Incomplete Multi-View Clustering
From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity
Breaking Spurious Correlations: Uncertainty-Driven Causal Transformers for AU Detection
MambaEye: A Size-Agnostic Visual Encoder with Causal Sequential Processing
MPL: Match-guided Prototype Learning for Few-shot Action Recognition
VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
Spectral-Aware Adaptive Convolution for Fine-Grained Cross-View Visual Localization
Agile Deliberation: Concept Deliberation for Subjective Visual Classification
Context-Aware Semantic Segmentation via Stage-Wise Attention
WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval
MFI-ResNet: Efficient ResNet Architecture Optimization via MeanFlow Compression and Selective Incubation
Diffusion Probe: Generated Image Result Prediction Using CNN Probes
AlphaMerging: Orthogonal Subspace Projection of Task Vectors to Reduce Task Interference for Multi-Task Model Merging
Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models
Rethinking Compact (<1M) Vision Models: Balancing Accuracy and Speed through Multi-Path Atrous Convolutions
Hi3Doc: Hierarchical Tri-Level Representations for Multimodal Long-Document Understanding
Electromagnetic Inverse Scattering from a Single Transmitter
Physically-Grounded Turbulence Mitigation with Frame-Shared Degradation Parameters
LongDocSpan: Extending LVLMs for Long Document Understanding
Point4Cast: Streaming Dynamic Scene Reconstruction and Forecasting
DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Models
From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding
M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization?
APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation
InstructTable: Improving Table Structure Recognition Through Instruction
EE-RL: Vision Language Guided Reinforcement Learning with Explorer and Expert model for End-to-End Autonomous Driving
D^2-FOSA: Dual-Diffusion Guided EEG-to-Image Reconstruction with Frequency-Oriented Semantic Alignment
SciPostLayoutTree: A Dataset for Structural Analysis of Scientific Posters
Understanding Counting Mechanisms in Large Language and Vision-Language Models
Efficient Document Parsing via Parallel Token Prediction
ChartAgent: A Chart Understanding Framework with Tool Integrated Reasoning
Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction
Boosting Reasoning in Large Multimodal Models via Activation Replay
RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning
Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models
FREE-Switch: Frequency-Based Dynamic LoRA Switch for Style Transfer
Enhancing Part-Level Point Grounding for Any Open-Source MLLMs
Detecting AI-Generated Forgeries via Iterative Manifold Deviation Amplification
FedVG: Gradient-Guided Aggregation for Enhanced Federated Learning
What and Where to Adapt: Structure–Semantics Co-Tuning for Machine Vision Compression via Synergistic Adapters
A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks
ViTPrompt: Training-Free Prompt Refinement with Visual Tokens for Open-Vocabulary Detection
A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models
Stability-Driven Motion Generation for Object-Guided Human-Human Co-Manipulation
Beyond Heuristic Prompting: A Concept-Guided Bayesian Framework for Zero-Shot Image Recognition
GM-Skip: Metric-Guided Transformer Block Skipping for Efficient Vision-Language Models
Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection
Dyna-ViT: Parameter-Free Pre-Encoder Token Pruning for Efficient Vision Transformers
SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection
GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents
Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention
MR. Illuminate: Zero-Shot Low-Light Image Enhancement with Diffusion Prior
CADC: Content Adaptive Diffusion-Based Generative Image Compression
MaMe: Matrix-Based Token Merging
Tiny Inference-Time Scaling with Latent Verifiers
TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond
DaMN: Deleting and Migrating Normalization Layers from Transformers
Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation
Relational Visual Similarity
Enriching Knowledge Distillation with Cross-Modal Teacher Fusion
Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment
UNIFORM: Unifying Knowledge from Large-scale and Diverse Pre-trained Models
Variational Graph-based Normal Integration
ProGIC: Progressive and Lightweight Generative Image Compression with Residual Vector Quantization
BoostSLT: Boosting Sign Language Translation via a Plug-and-Play Diffusion-Based Semantic Enhancer
MipKV: A Sparsify-then-Recover Paradigm for Accelerating Large Vision-Language Model Pre-Filling
OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness
Mix-to-Max: Optimizing Data Mixtures for Peak Vision-Language Efficiency
Immunizing Models Against Harmful Long-Horizon Fine-Tuning via Contractive Optimization Dynamics
INTERLACE: Interleaved Layer Pruning and Efficient Adaptation in Large Vision-Language Models
MimicTalker: A Multimodal Interactive and Memory-Enhanced Framework for Real-Time Dyadic 3D Head Generation
JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search
From Rays to Projections: Better Inputs for Feed-Forward View Synthesis
NexusFlow: Unifying Disparate Tasks under Partial Supervision via Invertible Flow Networks
SLAD : Shared LoRA Adapters for Task Specific Distillation
MDS-VQA: Model-Informed Data Selection for Video Quality Assessment
PP-Brep: Few-Shot B-rep Classification with Hybrid Graph Representation
D4C: Data-Free Quantization for Contrastive Language-Image Pre-Training Models
Multimodal Semantic Bias Mitigation for Diverse Text-To-3D Generation
ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers
VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
LNEM: Lunar Neural Elevation Model
MPM: Mutual Pair Merging for Efficient Vision Transformers
When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm
Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light
AlignFL: Adaptive Learning and Intelligent Generation of Networks for Federated Learning
TopoMA: Topology-Guided Multi-Agent Dense RGB 3D Reconstruction via Distributed Inference
A Polarized Reflection and Material Dataset of Real World Objects
Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference
Beyond Loss Values: Robust Dynamic Pruning via Loss Trajectory Alignment
Towards Photorealistic and Efficient Bokeh Rendering via Diffusion Framework
Cross-Domain Few-Shot Segmentation via Multi-view Progressive Adaptation
Positive Divide and Negative Discrepancy: A New Perspective on Multi-Label Logit Distillation
No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models
Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning
V2U4Real: A Real-world Large-scale Dataset for Vehicle-to-UAV Cooperative Perception
Beyond Accuracy: An Empirical Study of Perception Stability in Multimodal Large Language Models
PosterReward: Unlocking Accurate Evaluation for High-Quality Graphic Design Generation
Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams
OpenT2M: No-frill Motion Generation with Open-source, Large-scale, High-quality Data
STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction
M^3A Policy: Mutable Material Manipulation Augmentation Policy through Photometric Re-rendering
PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization
Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events
AOMGen: Photoreal, Physics-Consistent Demonstration Generation for Articulated Object Manipulation
When Pretty Isn’t Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators
Environmental Understanding Vision-language Model for Embodied Agent
DEGround: An Effective Baseline for Ego-centric 3D Visual Grounding With a Homogeneous Framework
Clinically-Grounded Counterfactual Reasoning for Medical Video Diagnosis
R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space
Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering
Photo3D: Advancing Photorealistic 3D Generation through Structure‑Aligned Detail Enhancement
Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents
Rosetta Stone For Unified MLLMs: A Unified Tokenizer to Decipher Understanding and Generation
Plug-and-Think: Structured Reasoning for Vision–Language–Action Models
Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video
World Model Robustness via Surprise Recognition
T2SGrid: Temporal-to-Spatial Gridification for Video Temporal Grounding
OVI-MAP: Open-Vocabulary Instance-Semantic Mapping
Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation
PlanGS: Active 3D Gaussian Reconstruction with Real-Time Planning
Drainage: A Unifying Framework for Addressing Class Uncertainty
A Simple Framework for Visual Navigation
Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment
Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach
Event-Based Optical Flow Leveraging Precise Event Timing
Fine-Grained Multi Image Object Hallucination Benchmark
MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention
Generative Event Pretraining with Foundation Model Alignment
AVGGT: Rethinking Global Attention for Accelerating VGGT
OrionEdit: Bridging Reference and Source Images for Generalized Cross-Image Editing
HelixTrack: Event‑Based Tracking and RPM Estimation of Propeller-like Objects
Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification
PEPR: Privileged Event-based Predictive Regularization for Domain Generalization
TimeBridge: Self-Supervised Video Representation Learning via Start-End Joint Embedding and In-Between Frame Prediction
MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving
Unleashing the Potential of Event-Based Stereo Via Coarse-to-Fine Bio-Inspired Regression
Trust-calibrated Collaborative Learning for Long-Tailed Visual Recognition
Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks
Rethinking Two-Stage Referring-by-Tracking in Referring Multi-Object Tracking: Make it Strong Again
An Interpretable Alzheimer's Disease Diagnosis Model via Gray Matter Attention Guided Counterfactual Reasoning
SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World
StreamDiT: Real-Time Streaming Text-to-Video Generation
IR-HGP: Physically-Aware Gaussian Inverse Rendering for High-Illumination Scenes via Generative Priors
Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP
CryoKRAQEN: Kernel-Regularized Annealing for Quantized Embedding Networks in Cryo-EM Heterogeneous Reconstruction
CLIP-Free, Label Free, Unsupervised Concept Bottleneck Models
SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control
Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition
Faithful Contouring: Near-Lossless 3D Voxel Representation Free from Iso-surface
Beyond Top-1: Forensic Analysis of Full Prediction Distributions Reveals Hidden Model Reasoning
Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation
Zero-Shot Textual Explanations via Translating Decision-Critical Features
Robust Spiking Neural Networks by Temporal Mutual Information
Correspondence-Attention Alignment for Multi-View Diffusion Models
GeoDexGrasp: Geometry-aware Generation for Data-efficient and Physics-plausible Dexterous Grasping
DMin: Scalable Training Data Influence Estimation for Diffusion Models
A Framework for Evaluating Zero-Shot Image Generation in Concept-Based Explainability
DiP: Taming Diffusion Models in Pixel Space
Self-Guided Integrated Gradient Method for Attribution
ReLaX: Reasoning with Latent Exploration for Large Reasoning Models
Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking
DiffBMP: Differentiable Rendering with Bitmap Primitives
Discovering Attention Head Interactions in Vision Transformers
SFR-Net: Steering-Fusion-Refining Network in Multi-label Zero-Shot Sewer Defect Detection
SAM 3D: 3Dfy Anything in Images
CogniEdit: Dense Gradient Flow Optimization for Fine-Grained Image Editing
Value bounds and Convergence Analysis for Averages of LRP attributions
STAR: Test-Time Adaptation Can Enhance Universal Prompt Learning for Vision-Language Models
Anti-Degradation Lifelong Multi-View Clustering
MReactor: Offline Multiple Appropriate Facial Reaction Generation with Hierarchical Cognitive Disentanglement
B-MoE: A Body-Part-Aware Mixture-of-Experts “All Parts Matter” Approach to Micro-Action Recognition
Learning by Neighbor-Aware Semantics, Deciding by Open-Form Flows: Towards Robust Zero-Shot Skeleton Action Recognition
Actionable Human Motion Generation via Latent Imitation and Fine-Grained Text Completion
GHOST: Fast Category-Agnostic Hand-Object Interaction Reconstruction from RGB Videos Using Gaussian Splatting
Hoi3DGen: Generating High-Quality Human-Object-Interactions in 3D
CoherentHand: Temporally Consistent 3D Hand Trajectory Synthesis with Semantic Motion Priors
Weakly Supervised Micro-Expression Spotting based on Boundary Refinement Mechanism and Cross-subject Learning Representation
FUSION: Full-body Unified Motion Prior for Body and Hands Via Diffusion
BridgeDiffusion: Latent Space Optimization for Independent Body-Part Generation with Motion Consistency Bridges in Interactive Dance
MARIO: Motion-Augmented Real-Time Multi-Sensor Inertial Odometry
BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices
WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos
TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis
How2Sign-Synth3D: Markerless Holistic Sign Language Performance Capture and Synthetic Data for Dense Landmark Tracking
SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance
EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting
Towards Metric-Aware Multi-Person Mesh Recovery by Jointly Optimizing Human Crowd in Camera Space
VoxFace: Streaming Audio-Visual Synthesis via Relay-Style Multi-Token Prediction for Interactive Conversation
OmniHead: A Unified Model for Dynamic Nonverbal Facial Behaviors
Detecting Precise Hand Touch Moments in Egocentric Video
Less is More: Multimodal Human Pose Estimation with Selective Fusion
PHYLOMAN: Generative Behavior Control via Fusing LLM Planning and Physics-based Control
Contact Matrix: Enhancing Dance Motion Synthesis with Precise Interaction Modeling
Learning Predictive Visuomotor Coordination
FSMC-Pose: Frequency and Spatial Fusion with Multiscale Self-Calibration for Cattle Mounting Pose Estimation
Bootstrapping Sign Language Annotations with Sign Language Models
OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation
THOM: Generating Physically Plausible Hand-Object Meshes From Text
ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors
All-Age Human Mesh Recovery
GeneFlow: Modeling Heredity and Variation via Flow Matching Transformers for Kinship Verification
Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers
MotionDuet: Dual-Conditioned 3D Human Motion Generation with Video-Regularized Text Learning
Fast-HOI: Fast Human-Object Interaction Synthesis via Distilled Interaction Prior and Physical Constrains
HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis
GeoHOI: Geometry-Enhanced Human-Object Interaction Video Generation via Hierarchical Multi-Modal Injection
TAUE: Training-free Noise Transplant and Cultivation Diffusion Model
GR-Diffusion: Graph-Guided Relational-Aware Diffusion via Attention Alignment
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
FREESTYLE: An Anchor-Free Mechanism for Training-Free Style-Aligned Image Generation
Is Your Text-to-Image Model Robust to Caption Noise?
Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction
RectifiedHR: Enable Efficient High-Resolution Synthesis via Energy Rectification
ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding
Objects in Generated Videos Are Slower Than They Appear: Models Suffer Sub-Earth Gravity and Don’t Know Galileo’s Principle...for now
Group Relative Attention Guidance for Image Editing
ControlPose: High-Fidelity Pose-Controlled Image Generation with Multi-Faceted Pose Disentanglement
FlowC2S: Flowing from Current to Succeeding Frames for Fast and Memory-Efficient Video Continuation
Latent-Compressed Variational Autoencoder for Video Diffusion Models
Deep Parameter Interpolation for Scalar Conditioning
Mining Real-World Image Relations for Large-Scale Controllable Generation and Editing
Disentangle Once, Control All: A Unified and Efficient Framework for Disentangling Multi-Condition Control in Human Video Generation
HAM: A Training-Free Style Transfer Approach via Heterogeneous Attention Modulation for Diffusion Models
Gaussian Shannon: High-Precision Diffusion Model Watermarking Based on Communication
Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation
Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations
Beyond Optimal Transport: Model-Aligned Coupling for Flow Matching
Stochastic Perturbations Improve Distribution-to-Distribution Generative Models
StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space
FA-MoE: Improving Medical Image Generation Through Frequency-Aware Mixture of Experts
Generated Reality: Human-Centric World Simulation Using Interactive Video Generation with Hand and Camera Control
VHOI: Controllable Video Generation of Human–Object Interactions from Sparse Trajectories via Motion Densification
LoViC: Efficient Long Video Generation with Context Compression
FedErase: Personalized Federated Unlearning for Text-to-Image Diffusion Models
Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion Models
Earthquake-Bench: Video Generation Benchmark for Earthquake Simulation
OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models
Block Cascading: Training Free Acceleration of Block-Causal Video Models
Activation-Norm Maximization to Accelerate Training in Flow-Matching Transformers
FREE: Uncertainty-Aware Autoregression for Parallel Diffusion Transformers
No Cache Left Idle: Accelerating diffusion model via Extreme-Slimming Caching
Inference-Time Alignment of Diffusion Models with Evolutionary Algorithms
TokenErase: Robust Concept Erasure via Visual-Injected Token Optimization
VisionCreator: A Native Visual-Generation Agentic Model with Understanding, Thinking, Planning and Creation
Animated-ART: Multi-Layer Transparent Video Generation
Rethinking Conditioning in Diffusion Models: Dynamic Token Scheduling for Efficient and Aligned Text-to-Image Generation
Attention-Guided Energy Optimization for Label-Aligned Anomaly Generation
USV: Unified Sparsification for Accelerating Video Diffusion Models
OminPSD: Layered PSD Generation with Diffusion Transformer
Video2LoRA: Unified Semantic-Controlled Video Generation via Per-Reference-Video LoRA
Depth Adaptive Efficient Visual Autoregressive Modeling
Cross-Resolution Diffusion Models Via Network Pruning
FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation
Understanding Reward Hacking in Text-to-Image Reinforcement Learning
OminiControl2: Efficient Conditioning for Diffusion Transformers
Seen-to-Scene: Keep the Seen, Generate the Unseen for Video Outpainting
AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction
MoVieDrive: Urban Scene Synthesis with Multi-Modal Multi-View Video Diffusion Transformer
InstaDA: Augmenting Instance Segmentation Data with Dual-Agent System
One Model for All: Unified Try-On and Try-Off in Any Pose via LLM-Inspired Bidirectional Tweedie Diffusion
Adversarial Concept Distillation for One-Step Diffusion Personalization
DSA: Dynamic Step Allocation for Fast Autoregressive Video Generation
Anomaly Agent: Unified Anomaly Retrieval and Synthesis Before Manufacturing
S^2DiT: Sandwich Diffusion Transformer for Mobile Streaming Video Generation
UniLat3D: Geometry-Appearance Unified Latents for Single-Stage 3D Generation
ColorMam: Color-Aware State Space Model for Image Color Style Transfer
NumeriKontrol: Adding Numeric Control to Diffusion Transformers for Instruction-based Image Editing
Towards Source-Aware Object Swapping with Initial Noise Perturbation
SyntheticManga: Training-Free Manga Generation with Phased Diffusion
Fast Autoregressive Video Generation with Diagonal Decoding
E-GRPO: High Entropy Steps Drive Effective Reinforcement Learning for Flow Models
Bind-Your-Avatar: Multi-Character-Talking Video Generation with Dynamic 3D-mask-based Embedding Router
SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations
PEDRA: Evaluating the Realism of Pedestrian Dynamics in Video Generation
VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning
Jano: Adaptive Diffusion Generation with Early-Stage Convergence Awareness
Low-Bitrate Video Compression through Semantic-Conditioned Diffusion
Decoupled Scale-wise Autoregressive Modeling for Visual Generation
TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation
Future Optical Flow Prediction Improves Robot Control and Video Generation
Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas
Drive-Cascade: Autoregressive Occupancy to LiDAR and Video Synthesis
ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation
Concept Erasure via Attention Redirection
Loom: Diffusion-Transformer for Interleaved Generation
Rethinking Training Dynamics in Scale-Wise Autoregressive Generation
HiStream: Efficient High-Resolution Video Generation via Redundancy Eliminated Streaming
Eevee: Towards Close-up High-resolution Video-based Virtual Try-on
Consistent Video Editing as Flow-Driven Image-to-Video Generation
IM-Animation: An Implicit Motion Representation for Identity-Decoupled Character Animation
UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation
Generative Visual Chain-of-Thought for Image Editing
UniLayDiff: A Unified Diffusion Transformer for Content-Aware Layout Generation
Blend-Aware Latent Diffusion: Mitigating Stitched Seams in Image Inpainting
One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation
RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation
SwiftPie: Lightning-fast Subject-driven Image Personalization via One step Diffusion
Video Generation Models are Good Latent Reward Models
Harnessing Layered Graphic Designs with Real Intentions for Text-to-Design Generation
VeCoR — Velocity Contrastive Regularization for Flow Matching
CETCam: Camera-Controllable Video Generation via Consistent and Extensible Tokenization
SafetyBPO: Bidirectional Preference Optimization for Safe Text-to-Image Generation
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation
DebFilter: Eradicating Biases Stashed in Value
PEdit: Pareto-Guided Image Editing via Dynamic Latent Trajectory Control
Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models
Pioneering Perceptual Video Fluency Assessment: A Novel Task with Benchmark Dataset and Baseline
Adapting Large VLMs with Iterative and Manual Instructions for Generative Low-light Enhancement
Beyond Pixel Loss: Video-INRs Prefer Perceptual Optimization
MVSSM: Motion-aware Visual State Space Model for Efficient Video Deblurring
PrismNet: Semantic-Aware Image Enhancement via Vision Transformer and Zero-Cost Gating
FLAIR: Frequency- and Locality-Aware Implicit Neural Representations
CtrlISP: Rescuing Low-Light RAW Images via Controllable Neural ISP
Deepfake-Agent: Aggregating Semantic Forgery Clues for Generalizable Detection
How far have we gone in Generative Image Restoration? A study on its capability, limitations and evaluation practices
PCSTracker: Long-term Scene Flow Estimation for Point Cloud Sequences
POS-ISP: Pipeline Optimization at the Sequence Level for Task-aware ISP
Semantic-Aware Spectral Reconstruction: A Spectral Library-Aided Unsupervised Method Based on the Diffusion Model
Linear Recurrent Unit with Semantic Modulation for Image Super-Resolution
RodNet: Visual Pathway-Inspired Adaptive Sparse Network for Efficient Low-Light Image Enhancement
LWTformer: A Detail-Aware, Learnable Wavelet-Transformer for Ancient Chinese Character Image Restoration
SAT: Selective Aggregation Transformer for Image Super-Resolution
PhyFusion: Physics-Aware Infrared and Visible Image Fusion via Modality-Specific Physical Priors
UnfoldIR: Rethinking Deep Unfolding Network in Illumination Degradation Image Restoration
Evaluating Low-Light Image Enhancement Across Multiple Intensity Levels
FALCON: Fast Adaptive Lightweight Computation of Intensities and Events for Depth Estimation
Learning to Translate Noise for Robust Image Denoising
QDM: Quadtree-Based Region-Adaptive Sparse Diffusion Models for Efficient Image Super-Resolution
AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution
Optical Tolerance-Compensated Diffusion Model for Image Restoration
TinySR: Shallow Diffusion Transformers for Real-World Image Super-Resolution
Inf-Dehaze: Beyond GPU Memory Constraints for Ultra-High-Resolution Image Dehazing
DenoiseGS: Gaussian Reconstruction Model for Burst Denoising
FlowSteer: Conditioning Flow Field for Consistent Image Restoration
P^2CS: Parallel Point Cloud Pre-Training with Semantic Consistency
Towards Calibrated Gradient-based Multi-Task Learning
Brain-Inspired Multimodal Spike Neural Network for Image-Text Retrieval
Conformal Cross-Modal Active Learning
Deep-to-Shallow Knowledge Transfer:Multi-Scale Self-Distillation with Bidirectional Aware for 3D Brain Segmentation
MedSAD-CLIP: Supervised CLIP with Token-Patch Cross-Attention for Medical Anomaly Detection and Segmentation
Rethinking Whole-Body CT Image Interpretation: An Abnormality-Centric Approach
Generative Vision-Language Multiple Instance Learning for Weakly Supervised Neonatal Fundus Screening and Reporting
Mitigating Batch Effects in Histopathology via Language-Mediated Robust Embedding Generation
PTF-CT: Polar-Aware Temporal-Frequential Iterative Reconstruction for Sparse-View CT
Learning from Noisy Prompts: Saliency-Guided Prompt Distillation for Robust Segmentation with SAM
Towards Noise-Robust Medical Segmentation via Chebyshev-Attention-Based Asymmetric UNet
Two-Stage 3D Pulmonary Vessel Reconstruction via Trunk--Expansion Coupled Point Cloud Generation
A Simple yet Effective Data Scaling Strategy for Semi-Supervised Medical Image Segmentation
DepthScopy: Decoupling Frequency for Endoscopic Depth Estimation in Sparsely-Textured Regions
ReCliFF: Adaptive Orthogonal Decoupling for Federated Fine-tuning of Medical MLLMs
Volumetrically Consistent Implicit Atlas Learning via Neural Diffeomorphic Flow for Placenta MRI
Vision-Language Models for Automated 3D PET/CT Report Generation
PaM-MIL: Proliferation and Metastasis Enhanced Localization for Multiple Instance Learning on Pathology Images
Surgical Procedural Planning as 3D World Modelling: Towards Automated Pulmonary Resection
From Adaptation to Generalization: Adaptive Visual Prompting for Medical Image Segmentation
AceMIL: Ordinal-Aware Multiple Instance Learning for Pathological Progression Analysis
PhySe-RPO: Physics and Semantics Guided Relative Policy Optimization for Diffusion-Based Surgical Smoke Removal
Anatomy-CoT: Teaching MLLMs to Reason in Radiology
DELRER: Disease Evolution-Informed Longitudinal Radiology Report Generation
M^4Fuse: Lightweight State-Space MoE with a Cross-Scale Gating Bridge for Brain Tumor Segmentation
DynaMind: Reconstructing Dynamic Visual Scenes from EEG by Aligning Temporal Dynamics and Multimodal Semantics to Guided Diffusion
MAE-XNT: A Foundation Model for Segmenting Neuronal Tissue Volumes Generated with X-Ray Nanotomography
NAKUL-Med: Spectral-Graph State Space Models with Dynamics Kernels for Medical Signals
Gaze into the Details: Locality-Sensitive Enhancement for OCTA Retinal Vessel Segmentation
M^3D-BFS: a Multi-Stage Dynamic Fusion Strategy for Sample-Adaptive Multi-Modal Brain Network Analysis
Multimodal Decoupled Dynamic Graph Learning for Brain Disease Diagnosis
TICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning
TP-Seg: Task-Prototype Framework for Unified Medical Lesion Segmentation
C3-Diff: Super-resolving Spatial Transcriptomics via Cross-modal Cross-content Contrastive Diffusion Modelling
MeMix: Multi-Encoder Mixture Framework for Medical Report Generation
Learning Spatial-Preserving Hierarchical Representations for Digital Pathology
Open-Set Spatial Gene Expression Prediction from Histological Images via Retrieval-Augmented Generation
Personalized Functional Brain Network Modeling with Adaptive Auto-Weighted Learning for Automatic Brain Disorder Diagnosis
Do Vision Models Perceive Illusory Motion in Static Images Like Humans?
Meta-CDMTransNet: Cross-Domain Multi-Scale Transformer Meta-Learning Framework for Few-Shot Breast Histopathological Image Classification
PLCReg: Correlation-Aware Polar-Linear Attention for Guiding Medical Image Registration
A Denoising-Enhanced Multimodal Learning Framework for Robust Nasal Endoscopy Report Generation
When Models Learn to Ask Why: Adaptive Causal Reasoning for Trustworthy Medical Vision–Language Models
PBSBench: A Multi-Level Vision-Language Framework and Benchmark for Hematopathology Whole Slide Image Interpretation
Gated Differential Linear Attention: A Linear-Time Decoder for High-Fidelity Medical Segmentation
Vision-Language Models Encode Clinical Guidelines for Concept-Based Medical Reasoning
PGDM: Physics-Guided Noise-Free Diffusion Model Based on Point Spread Function for Light-Scattering Removal in Unpaired Biomedical Images
Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios
Anatomy-Aware Adaptive Feature Perturbation Framework for Semi-Supervised MRI Segmentation
EI: Early Intervention for Multimodal Imaging Based Disease Recognition
Rethinking Medical High-Modality Learning Under Missingness — A Long-Tailed Distribution Perspective
HazeMatching: Dehazing Light Microscopy Images with Guided Conditional Flow Matching
Learning Priors via Hybrid Visual Autoregressive Modeling for Medical Image to Image Translation
BLEG: LLM Functions as Powerful fMRI Graph-Enhancer for Brain Network Analysis
RelativeFlow: Taming Medical Image Denoising Learning with Noisy Reference
UGLMM: Towards Unified Vision Grounding with Large Multimodal Model
FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval
Training-Free Cross-Modal Alignment via Anchor Profiles with Statistical Significance Testing
CREM: Compression-Driven Representation Enhancement for Multimodal Retrieval and Comprehension
LLM Guided Multi Style Typography and Layout Generation via Dynamic Direct Preference Optimization
FusionBridge: An Efficient Fusion Via Feature Disentanglement for Multi-Modal Object Re-Identification
LlamaRG: A Multi-View Large Language Model for Radiology Report Generation
Mitigating Information Forgetting via Entropy-Driven Progressive Retrospection for Multimodal Long Reasoning
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression
R²MoE: Representation and Expert Selection Dual-Regularized Mixture-of-Experts for Multimodal Clinical Data
DUALVISION: RGB–Infrared Multimodal Large Language Models for Robust Visual Reasoning
Parallel In-context Learning for Large Vision Language Models
Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning
Prototype and Sample Level Semantic Alignment for Incomplete Multi-View Clustering
Rethinking VLMs for Image Forgery Detection and Localization
DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization
OTPrune: Distribution-Aligned Visual Token Pruning Via Optimal Transport
Causal Chain-Guided Reasoning for Modular and Explainable Causal-Why Video Question Answering
Materialistic RIR: Material Conditioned Realistic RIR Generation
From Coarse to Precise: Rethinking and Bridging Localization in Multimodal Large Language Models
Do Audio-Visual Large Language Models Really See and Hear?
DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation
FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning
Anticipatory Planning for Multimodal AI Agents
Quantifying the Gap between Understanding and Generation within Unified Multimodal Models
VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs
Concise Geometric Description as a Bridge: Unleashing the Potential of LLM for Plane Geometric Problem Solving
HippoMM: Hippocampal-inspired Multimodal Memory for Long Audiovisual Event Understanding
A Diagnostic Study of Region-Based Representations in Multimodal LLMs
HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model
UMI-HOI: Unifying Multimodal Information with Semantic Multi-Head Attention for Human–Object Interaction Detection
AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation
Visual2Echo Compositional Contrastive Learning (V2E-CCL): Binaural Knowledge Distilled Network for Depth Prediction
TextBind: Your Vision-Language Models are Naturally Unified Multimodal Models
Learning to Walk the Right Paths: Task-Responsive Graph Reasoning for Multimodal Inference
CLASH: A Benchmark for Cross-Modal Contradiction Detection
DA-CLIP: Mitigating Granularity Mismatch in Zero-Shot Anomaly Detection via Decoupled Text-Visual Alignment
HAIT: Hybrid Adversarial Iterative Training for Mitigating Object Hallucination in Large Vision–Language Models
Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations
CP-IMoE: Collaborative Prompt-Guided Interactive Mixture-of-Experts for Incomplete Multimodal Learning
Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models
PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment
If you can describe it, they can see it: Cross-Modal Learning of Visual Concepts from Textual Descriptions
LiteEmbed: Adapting CLIP to Rare Classes
CADReasoner: Iterative Program Editing for CAD Reverse Engineering
COSTA: Collaborative Open-Set Test-Time Adaptation Through Robust Prototype Learning
Perturb and Recover: Fine-Tuning for Effective Backdoor Removal from CLIP
PrismPrune: Decoupling Saliency and Diversity in Attention for Efficient Visual Token Pruning in VLMs
Scaling Pre-training to One Hundred Billion Data for Vision Language Models
HAFM: A Post-Fusion Gating Module for Haze-Aware RGB–Thermal Object Detection
CaptAin: Caption-driven Alignment for Bridging Modality Gaps in Partially Relevant Video Retrieval
Dual Anchors, Do It Better: Hierarchical Group Merging for Zero-Shot Anomaly Detection
HeartcareGPT: A Unified Multimodal ECG Suite for Dual Signal–Image Modeling and Understanding
Unbiased Dynamic Multimodal Fusion
Video Reasoning Without Training
Efficient Discrete Diffusion Model for Scalable Multi-Objective Traveling Salesman Problem
EpiMask: Leveraging Epipolar Distance Based Masks in Cross-Attention for Satellite Image Matching
S³O: Selective Spatial-Spectral Operator for Cross-Scale Fusion
Fast Kernel-Space Diffusion for Remote Sensing Pansharpening
Unified Urban Tuning: Co-Enhancing Satellite and Street View Reasoning with a Progressive Tuning Framework
GReD-RSITR: A Generative Re-Examined Discriminative Framework for Remote Sensing Image-Text Retrieval
ZODS-RS — Zero-Training Oriented Detection & Segmentation for Remote Sensing
Turning Generators into Retrievers: Unlocking MLLMs for Natural Language-Guided Geo-Localization
Optimal-Transport-based Feature Alignment for Multimodal Change Detection
HarmoniDiff-RS: Training-Free Diffusion Harmonization for Satellite Image Composition
CrossWeaver: Towards Efficient Cross-Modal Interweaving and Decoupling for Weakly-Aligned Multispectral Object Detection
ProSM: Progressive Soft Masking for Fine-Grained Remote Image Segmentation
UniD-Shift: Towards Unified Semantic Segmentation via Interpretable Shared–Private Multimodal Decomposition
OffNadirLoc: Benchmark and Framework for Challenging UAV-to-Satellite Geo-Localization under Large Off-Nadir Views
M-PhyGs: Multi-Material Object Dynamics from Video
Diffusion^2: Turning 3D Environments into Radio Frequency Heatmaps
Controllable Radar Simulation with Waveform Parameter Embedding
mmDiff: A Noise-Robust Differentiable Ray-Tracing Framework for mmWave Scene Calibration and Channel Prediction
GLOW: Global Illumination-Aware Inverse Rendering of Indoor Scenes Captured with Dynamic Co-Located Light & Camera
Scene-Level Heterogeneous Physics Simulation with 3D Gaussian Splats
How to Achieve Prototypical Birth and Death for OOD Detection?
Uncertainty-Aware Cross-Modal Opinion Interaction: A General Frameworkfor Visible-Infrared Vehicle and Person Re-Identification
EIRES:Training-free AI-Generated Image Detection via Edit-Induced Reconstruction Error Shift
Vote-in-Context: VLMs as Explainable Zero-Shot Rank Fusers
PRADA: Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images
HypHOI: Exploring Hierarchical Hyperbolic Embeddings for Human-Object Interaction Detection
A Low-Rank Learning Framework Integrating Detection, Masking, and Recovery for Occluded Facial Expression Recognition
DSAA: Dual-Stage Attribute Activation for Fine-Grained Open Vocabulary Detection
ConSel: Concept-Aware Self-supervised Learning for Regression Beyond Ordinal Tasks
Rolling and Denoising: Rethinking Dynamic Modal Fusion for Multi-Modal Object Re-Identification
Adapting with an Open Mind: Leveraging Open-Vocabulary Detectors for Closed Set Source-Free Domain Adaptive Object Detection
SFS-DETR: Spatial-Frequency Selection for UAV Object Detection
ForenDeX: Unlocking Forensic Insights for Explainable AI-Generated Image Detection
Long-Tailed Out-of-Distribution Detection with Refined Separate Class Learning
Bridging Day and Night: Unsupervised Cross-Domain Re-Identification with Synergistic Prompt and Prototype Learning
Advancing Open-Set Detection and Segmentation via Disentangled Representations
Disrupting Positional Encoding for Effective Open Set Recognition
ODOV: Benchmark the Open-Domain Open-Vocabulary Object Detection
Leave No Stone Unturned: Uncovering Holistic Audio-Visual Intrinsic Coherence for Deepfake Detection
Region-Aware Hierarchical Sub-Feature Alignment for Robust EEG-Based Visual Decoding
Super Sparse DETR:YOLO-Competitive Convergence and Acceleration
Bi-Level Optimization for Single Domain Generalization
SA-Matching DETR: A Lightweight Transformer Detector with Enhanced Scale Adaptive Matching
Asymmetric Collaborative Distillation for Asymmetric Image Retrieval
OKGraph: Online Knowledge Graph Probing for Open-vocabulary Recognition
Large Multimodal Models as General In-Context Classifiers
Indexing Multimodal Language Models for Large-scale Image Retrieval
EvoPrompt-ReID: A Bilevel Optimization Framework for Prompt-Encoder Co-evolution in Image Re-Identification
Leveraging Arbitrary Data Sources for AI-Generated Image Detection Without Sacrificing Generalization
OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism
PTAD: Pose and Texture Agnostic Anomaly Detection
Mitigating the ID–OOD Tradeoff in Open-Set Test-Time Adaptation
Towards Universal Open-Set Visual Font Recognition Via Augmented Synthetic Similarity
VR-CLIP: Visual Refinement of CLIP for Zero-Shot Semantic Segmentation
Exploring Hierarchical Consistency and Unbiased Objectness for Open-Vocabulary Object Detection
Once for All: An End-to-End Paradigm for VLM-Based Domain-Generalized Object Detection
SoREL: Soft-Label Refurbishment with Ensemble Learning for Noisy Long-Tailed Classification
Unsupervised Graph Partitioning Framework for Background Suppression in Multi-Query Vehicle Re-Identification
Revisiting Real-Time Detection Transformer with Efficient Encoder Design
PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views
Ninja Codes: Neurally Generated Fiducial Markers for Stealthy 6-DoF Tracking
DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer
SpHOR: A Representation Learning Perspective on Open-set Recognition for Identifying Unknown Classes in Deep Neural Networks
Complexity of Linear Regions in Self-supervised Deep ReLU Networks
Decoupled Sub-Feature Uncertainty Modeling for Robust Multimodal Representation Learning
Pre-trained Models Can Count (Almost): Exploring Quantitative Structure in Visual Representations
A-SelecT: Automatic Timestep Selection for Diffusion Transformer Representation Learning
HyperFM: A Efficient Hyperspectral Foundation Model with Spectral Grouping
Seeing Through Fog: Towards Fog-Invariant Action Recognition
Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models
FedAR: Attribute-Guided Representation Learning for Heterogeneous Federated Learning
ZeroDiff++: Balancing Semantic Diffusion Dynamics for Robust Zero-Shot Learning
Equivariant Unsupervised Object Detection with Learnable Riesz Transform and Composite Spatial Transformers
MART: Mechanism-disentanglement Anchor-Routed Training for Learning with Open-World Noisy Data
Online Interpretable Matrix Decomposition for Large-Scale Streaming Data
Object-Centric Vision Token Pruning for Vision Language Models
BrainStack: Neuro-MoE with Functionally Guided Expert Routing for EEG-Based Language Decoding
BiomedHELIX : HiErarchical-Local Interaction eXploration for Biomedical Vision-Language Models
From Fewer Samples to Fewer Bits: Reframing Dataset Distillation as Joint Optimization of Precision and Compactness
Seeing Helps Reasoning in Language Models
Layer Embedding Deep Fusion Graph Neural Network
From Horizontal to Rotated: Cross-View Object Geo-Localization with Orientation Awareness
LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation
Learning to Reason: Targeted Knowledge Discovery and Fuzzy Logic Update for Robust Image Recognition
GaussFiller: Unleashing VLM-Expert Guidance for 3D Scene Completion with 3D Gaussian Splatting
GEODE: Geometry-Guided Discrete Diffusion for Open-Vocabulary 3D Scene Graph Generation
Map2Thought: Explicit 3D Spatial Reasoning via Metric Cognitive Maps
SCP: Spatial Causal Prediction in Video
Image-based Outlier Synthesis With Training Data
SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery
Entropy-Constrained Information Optimal Transport for Multi-View Geo-Localization
Revisiting Image Manipulation Localization under Realistic Manipulation Scenarios
Learning to Wander: Improving the Global Image Geolocation Ability of LMMs via Actionable Reasoning
CADRNet: Cognitively-Inspired Active Vision for 3D Reasoning Segmentation via Differentiable Rendering
Direct Language Embedding Enables Gaussian Splatting for Large Scenes
CogNet: Multi-Agent Collaborative Reasoning and Verification for Salient Object Ranking
MPerS: Dynamic MLLM MixExperts Perception-Guided Remote Sensing Scene Segmentation
Towards Generalization of Scene Text Tampering Localization via Causal Invariance
Background-Compensated Audio-Visual Semantic Modulation Framework for Audio-Visual Event Localization
POMA-3D: The Point Map Way to 3D Scene Understanding
Gazemo: Mimicking Human Saccades via Foveal-Peripheral Feature Modeling for Lightweight Semantic Segmentation
MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors
AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting
PrAda: Few-Shot Visual Adaptation for Text-Prompted Segmentation
SAGE: Shape-Adapting Gated Experts for Adaptive Histopathology Image Segmentation
Prompt-driven Small Object Instance Segmentation in Earth Observation
OV-Stitcher: A Global Context-Aware Framework for Training-Free Open Vocabulary Semantic Segmentation
SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation
Towards Complete Activation: Foreground-Background Multi-Perspective Guided Cross-Support for Few-Shot Segmentation
MHMamba: Multi-Head Mamba for 3D Brain Tumor Segmentation
ROSE: Retrieval-Oriented Segmentation Enhancement
ConInfer: Context-Aware Inference for Training-Free Open-Vocabulary Remote Sensing Segmentation
Unify the Views: View-Consistent Prototype Learning for Few-Shot Segmentation
Autoregressive Universal Video Segmentation Model
FCL-COD: Weakly Supervised Camouflaged Object Detection with Frequency-aware and Contrastive Learning
Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination
Weakly-Supervised Referring Video Object Segmentation Through Text Supervision
TALENT: Target-Aware Efficient Tuning for Referring Image Segmentation
DeepDP-TGMM: Amortized Non-Parametric Clustering for Hyperspherical Self-Supervised Representations
Proto-SaGa: Prototype-based 3D Scene Segmentation with Semantic-aware Gaussian Grouping
RecycleLoRA: Rank-Revealing QR-Based Dual-LoRA Subspace Adaptation for Domain Generalized Semantic Segmentation
Instruction-Focus-Prompt:Semantics-Driven Structural Prompts for Universal SAM Segmentation
Continual Alignment for SAM: Rethinking Foundation Models for Medical Image Segmentation in Continual Learning
VirPro: Visual-Referred Probabilistic Prompt Learning for Weakly-Supervised Monocular 3D Detection
A Single Pixel is All You Need: Weakly Supervised Medical Image Segmentation using Discrete Denoising Diffusion Models
AdaMeta: Adaptive Meta-Learning with Dynamic Task Relational Inference for Few-shot learning
NRFP: A Noise-Robust Feature Plugin for Source-Free Domain Adaptation
Label-Agnostic Category Discovery
Learning from Label Proportion with Dual-Proportion Constraints
Test-Time Distillation for Continual Model Adaptation
Another BRIXEL in the Wall: Towards Cheaper Dense Features
Task-Specific Knowledge Improves Generalization: A Logits-Based Framework for Continual Learning of Vision-Language Models
DARN: Dynamic Adaptive Regularization Networks for Efficient and Robust Foundation Model Adaptation
Training-Free Uncertainty-guided Logit Adjustment for Few-Shot Class-Incremental Learning
Model Merging on Loss Landscapes: A Geometric Perspective
DGD: Density Gradient-guided Diffusion for Long-Tailed Clustering
DGP: Dynamic Gradient Projection for Task-Adaptive Continual Learning
Bootstrap Your Own Classifier: Your Pretrained Vision Models are Secretly Strong Continual Learners
Memory-efficient Continual Learning with Prototypical Exemplar Condensation
Continual Adaptation of Vision Foundational Models for Semantic Segmentation in Adverse Weather
ReMem: A Dynamic Memory Evolution Detector for Zero-Shot Anomaly Detection
CurrMix: Curriculum-Enhanced MixUp for Long-Tailed Visual Recognition
Class-Aware Drift Compensation for Non-Uniform Semantic Shift in Continual Learning
Onboarding Without Forgetting: Hypernetwork Personalization with Data-Free Replay for Personalized Federated Learning
FedNPC: Stochastic Noise-driven Post-hoc Classifier Calibration Method for Federated Long-tailed Learning
Learning Multi-Modal Prototypes for Cross-Domain Few-Shot Object Detection
MuSCM: Mutual Spatial Correlation Mapping for Class Incremental Detection Transformer
AFCL: Achieving Spatio-Temporal Invariance to Data Heterogeneity in Federated Continual Learning
SAGA: Semantic Anchor-Guided Alignment for Multi-Source Domain Adaptive Object Detection
DEED: Dual-Channel Enhanced Ensemble Distillation for Uncertainty-Aware Recognition
Wake the Sleeping Weights: Sparsely-Activated Continual Test-Time Adaptation for Medical Image Segmentation
Dynamic Pseudo-Label Assignment and Consistent Prototypical Learning for Few-Shot Class-Incremental Learning
Hold-One-Shot-Out (HOSO) for Validation-Free Few-Shot CLIP Adapters
Learning through Creation: A Hash-Free Framework for On-the-Fly Category Discovery
Frequency-Guided Iterative Bi-directional Exchange Network for Cross-Domain Few-Shot Segmentation
Revisiting Prototype Rehearsal for Exemplar-Free Continual Learning: Manifold-Aware Boundary Sampling with Adaptive Class-Balanced Loss
SCOPE: Spatially Ordered Continual Learning for 3D Segmentation
Learning to Propose Pose for Category-Agnostic Objects via Joint Refinement with Co-Matching Supervision
Is Prompt Selection Necessary for Task-Free Online Continual Learning?
ReConText3D: Replay-based Continual Text-to-3D Generation
Now You See It, Now You Don't: Instant Concept Erasure for Safe Text-to-Image and Video Generation
ECOC-IL: Robust and Efficient Label LDP for Imbalanced Learning
Safe Codebook: Token-Level Moderation for Safer Visual Autoregressive Generation
Towards Universal and Lightweight Coverless Image Steganography with Multimodal Large Language Models Assistance
A Visual Semantic Adaptive Watermark Grounded by Prefix-Tuning for Large Vision-Language Model
TriGuard-FL: A User-Centric Trust Triad in Federated Learning via Auditable Data, Verifiable Contributions, and Antidote-Driven Mitigation
Assessing the Reliability of Image Quality Metrics and Mitigating Quality Bias in Generative Models
Efficient Unlearning through Maximizing Relearning Convergence Delay
Robust Continual Unlearning against Knowledge Erosion and Forgetting Reversal
Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings
RAZOR: Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models
FedOrtho: Efficient Federated Unlearning Via Orthogonal Convolution and Adaptive Soft Pruning
Improving Synthesized Image Detection by Disentangling Generator-Shared and Generator-Specific Image Artifacts
PLR-Gate: Real-Time Gradient Privacy Assessment and Gated Transmission for Secure Federated Learning
A Unified Privacy-Utility Framework for Collaborative Inference via Randomized Smoothing
Verify Claimed Text-to-Image Models Via Boundary-Aware Prompt Optimization
Towards Robust Content Watermarking Against Removal and Forgery Attacks
Revisiting Model Inversion Evaluation: From Misleading Standards to Reliable Privacy Assessment
CBDC: Clean Bias Direction Construction for Unsupervised Debiasing in Vision-Language Models
Erased, But Not Forgotten: Erased Rectified Flow Transformers Still Remain Unsafe Under Concept Attack
Leveraging Unlabeled Data from Unknown Sources via Dual-Path Guidance for Deepfake Face Detection
SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models
When Agents Steer Human Perception: How AI-Selected Images Can Convertly Alter Disagreements
UniShield: An Adaptive Multi-Agent Framework for Unified Forgery Image Detection and Localization
On the Group Disparities Arising from Machine Unlearning
Count What Repeats: Period-Adaptive Multi-Scale Consistency for Self-Supervised Repetitive Action Counting
Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
ConfDiff: Confidence-Guided Representation Diffusion for Video Moment Retrieval
Evolutionary Multi-Agent Collaboration for Real-World Video Face Restoration
STS-Mixer: Spatio-Temporal-Spectral Mixer for 4D Point Cloud Video Understanding
HiVid-Narrator: Hierarchical Video Narrative Generation with Scene-Primed ASR-anchored Compression
D^2-STX: Decoupling Spatial-Temporal Cross-Attention for Dual-branch Repetitive Action Counting
Group-DINOmics: Incorporating People Dynamics into DINO for Self-supervised Group Activity Feature Learning
Mamba-VMR: Multimodal Query Augmentation Via Generated Videos for Precise Temporal Grounding
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
TP^2-DETR: Unlocking Deformable DETR for Zero-Shot Temporal Action Proposal Generation with Temporal Feature Pyramids
QENN: A Quantum Entanglement-Inspired Neural Network for Interaction and Relationship Prediction in Story Videos
FineGrade: A Rule-Consistent Scoring Framework for Fine‑Grained Action Quality Assessment
One Identity, Many Roles: Multimodal Entity Coreference for Enhanced Video Situation Recognition
REBA: Residual Mixture-of-Experts and Bidirectional Video–Text Alignment for Better Fine-grained Weakly Supervised Video Anomaly Detection
ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding
VIDEOP2R: Video Understanding from Perception to Reasoning
Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models Via Spatial-Temporal Forest Modeling
HARP: Hierarchical Adaptive Ranking with Probabilistic Modeling for Skill Determination
STORM: End-to-End Referring Multi-Object Tracking in Videos
Extending Segment Anything Model 2 to Multi-Object Tracking by Optimizing Hierarchical Trajectory Memory
NCSTR: Node-Centric Decoupled Spatio-Temporal Reasoning for Video-based Human Pose Estimation
MOSSTrack : Modality-Specific Spatio-Temporal Context Learning for RGB-T Tracking
Temporally Consistent Long-Term Memory for 3D Single Object Tracking
DM^3T: Harmonizing Modalities via Diffusion for Multi-Object Tracking
IRDINO: Adapting DINOv3 with Second-Order Motion Awareness for Moving Infrared Small Target Detection
SemanticMoments: Training-Free Motion Similarity via Third Moment Features
TAPNext++: What’s Next for Tracking Any Point (TAP)?
ARGS: Auto-Regressive Gaussian Splatting via Parallel Progressive Next-Scale Prediction
100Editor: 100+ Views per Batch and Minute-Scale View-Consistent 3D Editing
DIAMOND-SSS: Diffusion-Augmented Multi-View Optimization for Data-efficient SubSurface Scattering
Reason-SVG: Enhancing Structured Reasoning for Vector Graphics Generation with Reinforcement Learning
Harmonized Multi-Layer Text-to-Image Generation with Generative Priors
StabiGS: Video Stabilization through Rendering-Aware Trajectory Optimization in 3DGS-Reconstructed Scenes
More Traces Better: Unified Artifact Modeling for Generalizable and Robust AI-generated Image Detection
Predicting Gene Expression in Spatially Resolved Transcriptomics Across Samples Through Probabilistic Fusion of Hierarchical Histology and Spatial Information
Don't Let the Information Slip Away
FraQAT: Quantization Aware Training with Fractional Bits
MAGIC: Few-Shot Mask-Guided Anomaly Inpainting with Prompt Perturbation, Spatially Adaptive Guidance, and Context Awareness
Video Inspector: An Agentic-RL Framework and Benchmark for Human-Aligned Generative Video Evaluation
From Pixels to Nucleotides: End-to-End Token-Based Video Compression for DNA Storage
CoPS: Conditional Prompt Synthesis for Zero-Shot Anomaly Detection
PSIM: Perceptual Similarity Index Measure
UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding
Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models
GreenPlanner: Practical Floorplan Layout Generation via an Energy-Aware and Function-Feasible Generative Framework
Dual-Stage Parameter-Efficient Fine-Tuning for Consistent Spatial and Temporal Representation
WideEye: Achieving Wide Field-of-view Traffic Video Analytics With Dynamic Orientation Adaptation
Restore-R1: Efficient Image Restoration Agents via Reinforcement Learning with Multimodal LLM Perceptual Feedback
Pose-dIVE: Pose-Diversified Augmentation for Person Re-Identification
Adaptive Reinforcement for Open-ended Medical Reasoning via Semantic-Guided Reward Collapse Mitigation
BrandFusion: A Multi-Agent Framework for Seamless Brand Integration in Text-to-Video Generation
IEA: Amateur-Friendly Conversational Image Editing Agent via Three Stages of Multitask Alignment
QuPAINT: Physics-Aware Instruction Tuning Approach to Quantum Material Discovery
CLIPtone-GO: Geometry‐Aware, Gradient-Orthogonalized Text-Guided Color Tone Adjustment
Generative Digital Twins: Vision-Language Simulation Models for Executable Industrial Systems
Exploiting the Source-Asymmetry Confidence Gap for Generalizable AI-Generated Image Detection
CineMatte: Background Matting for Virtual Production and Beyond
GATE: Gaussian-Attentive Transformer for Uncertainty-Aware Age Estimation
GRAFT: Graph-Based Affordance Transfer via Part Correspondence
Face Time Traveller : Travel Through Ages Without Losing Identity
KGGAT: Knowledge-Guided Graph Attention Network for Multi-Label Image Classification
IntentEdit: Multi-Agent Reasoning for Intent-Driven Complex Image Editing
Gen-n-Val: Agentic Image Data Generation and Validation
SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding Via Functional Structure Units
DARTS: Distance-Aware Robust Training for Selective Classification
Modulate-and-Map: Crossmodal Feature Mapping with Cross-View Modulation for 3D Anomaly Detection
PestVL-Net: Enabling Multimodal Pest Learning Via Fine-grained Vision-Language Interaction
Plug-and-Play Dynamic In-context Learning with Stochastic Regularization for Screen Content Image Super-Resolution
EscherNet++: A Scalable Multi-View Framework for Amodal Completion, Novel View Synthesis and Feed-Forward 3D Reconstruction
Human-Intervention Segmentation via Federated Intent Embedding and Multi-Mask Recommendation
Di3PO - Diptych Diffusion DPO for Targeted Improvements in Image Generation
Robust Image Self-Recovery against Tampering using Watermark Generation with Pixel Shuffling
Learning to Select, Learning to Judge: Active Preference Alignment for Mars Terrain Segmentation
Attention Never Lie: Visual Attention Defocus Reveals and Rectifies Hallucinations in MLLMs
Organizing Unstructured Image Collections using Natural Language
Thinking with Blueprints: Assisting Vision–Language Models in Spatial Reasoning via Structured Object Representation
Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification
Efficient3D : A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs
HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models
Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models
Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs
Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach
Visual Reasoning Through Tool-Supervised Reinforcement Learning
VSI: Visual–Subtitle Integration for Keyframe Selection to Enhance Long Video Understanding
ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models
Myopia Rectification: KV Cache Pruning for MLLMs Via Dynamic Attention Subsidy and Token Reclamation
NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries
Logical Consistency Optimization for Few-Shot Weakly Supervised Video Anomaly Detection
VEGAS: Mitigating Hallucinations in Large Vision-Language Models via Vision-Encoder Attention Guided Adaptive Steering
COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence
CoVCR: Bridging Visual Narrative Gaps via Context Generation for Robust Commonsense Reasoning
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling
Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding
VoQA: Visual-only Question Answering
Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality
Language-Augmented Semantic Priors for B-Spline Surface Fitting
Flash-Unified: A Training-Free and Task-Aware Acceleration Framework for Native Unified Models
Trajectory-Diversity-Driven Robust Vision-and-Language Navigation
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
Distilling Counterfactual Reasoning from Language to Vision: Causal Graph-Guided Post-Training for Video Understanding
Exploring Physics-aware Video Generation through Reinforcement Learning with Autoregressive Tokens
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
Overthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language Models
GDP: Graph-Based Dynamic Personalization for Multimodal Large Language Models
AnyExperts: On-Demand Expert Allocation for Multimodal Language Models with Mixture of Experts
Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space
RVLF: A Reinforcing Vision–Language Framework for Gloss-Free Sign Language Translation
Weaver: End-to-End Agentic System Training for Video Interleaved Reasoning
AITP: Traffic Accident Responsibility Allocation via Multimodal Large Language Models
Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment
Fine-Grained Visual Prompt and Region Self-Distillation for Retrieval-Augmented VQA
RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models
Modality-Aware Bit Allocation for Mixed-Precision Quantization of Vision-Language Models
Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models
Analyzing and Enhancing Visual Learning in LLM-based Radiology Report Generation
DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration
Semantic Guided Feature Disentanglement and Reconstruction for Domain Adaptive Object Detection
Dual-Modality Anchor-Guided Filtering for Test-Time Prompt Tuning
Towards Efficient Multimodal Unified Reasoning Model via Model Merging
DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning
Can Textual Reasoning Improve the Performance of MLLMs on Fine-Grained Visual Classification?
VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack
StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios
MASS: Motion-Aware Spatial–temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
Beyond Syntax: Action Semantics Learning for App Agents
Learning to Select Visual In-Context Demonstrations
CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging
Mull-Tokens: Modality-Agnostic Latent Thinking
SPHINX: A Synthetic Environment for Visual Perception and Reasoning
It’s Time to Get It Right: Improving Analog Clock Reading and Clock-Hand Spatial Reasoning in Vision-Language Models
Uncertainty-Guided Graph Formulation via MWIS for Token Pruning in LVLMs
From Alignment to Reason: Multi-Agent Debate for Tactical Badminton Video Retrieval
Distilling Out-of-Distribution Knowledge from Large Language Models for CLIP Generalization
Multimodal Reasoning with Explicit Reasoning Patterns and Rewards
VRAG-DFD: Verifiable Retrieval-Augmentation for MLLM-based Deepfake Detection
MIRA: Multimodal Iterative Reasoning Agent for Image Editing
Mitigating Visual Context Degradation in Large Multimodal Models: A Training-Free Decoupled Agentic Framework
CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images
Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Model
Recursive Think-Answer Process for LLMs and VLMs
GenSRL: Generative Spatiotemporal Representation Learning for Ophthalmic Prognosis Prediction
Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection
LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation
VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models
Mitigating Vision-Text Order Bias in Vision-Language Model
MARS-RL: Enhancing Multi-Agent RAG Systems for Multi-Modal Documents via Strategic Reasoning with Reinforcement Learning
Beyond Single Object: Learning 3D Relations with Large Language Models
CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare
Attention-Space Contrastive Guidance for Efficient Hallucination Mitigation in LVLMs
UnrealSpace: Analyzing Spatial Understanding and Reasoning in Controllable Simulation
Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models
Learning When to Look: A Disentangled Curriculum for Strategic Perception in Multimodal Reasoning
Hierarchical Textual Knowledge for Enhanced Image Clustering
Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?
Entropy-Based Visual Re-perception Inference for Multimodal Models
VACoT: Rethinking Visual Data Augmentation with VLMs
Open World Image Aesthetic Assessment
coDrawAgents: A Multi-Agent Dialogue Framework for Compositional Image Generation
PosterGen: Aesthetic-Aware Multi-Modal Paper-to-Poster Generation Via Multi-Agent LLMs
Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks
Why MLLMs Struggle to Determine Object Orientations
VADE: Variance-Aware Dynamic Sampling via Online Sample-Level Difficulty Estimation for Multimodal Reinforcement Learning
Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration
Alleviating Hallucinations in Large Vision-Language Models via Decoding-Time Perturbation Adaptation
RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning
We use cookies to store which papers have been visited.
I agree
Successful Page Load
We use cookies to store which papers have been visited.
I agree