CVPR 2026 Events with Videos
Posters
- PrivateEyes: Gaze-Preserving Anonymization for Data Sharing
- Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation
- X-WIN: Building Chest Radiograph World Model via Predictive Sensing
- Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction
- Generating Humanless Environment Walkthroughs from Egocentric Walking Tour Videos
- Heterogeneous Decentralized Diffusion Models
- Data-Centric Meta-Learning for Robust Few-Shot Generalization
- Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
- Olbedo: An Albedo and Shading Aerial Dataset for Large-Scale Outdoor Environments
- AD-GBC: Anisotropic Granular-Ball Skip-Connection Refiner for UNet-Based Medical Image Segmentation
- TextOVSR: Text-Guided Real-World Opera Video Super-Resolution
- DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning
- Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees
- Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features
- Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation
- Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning
- GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution
- β-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment
- Chain of World: World Model Thinking in Latent Motion
- Beyond Euclidean Gossip: KL-Barycentric Consensus on Heterogeneous and Imbalanced Images
- FAVE: A Structured Benchmark for Fine-Grained Audio-Visual Temporal Evaluation in Multimodal LLMs
- Measure The Feature Universe: Topology-based Pseudo Labeling and Gravity Consistency for Source-Free Domain Adaptation
- RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations
- SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving
- Computation and Communication Efficient Federated Unlearning via On-server Gradient Conflict Mitigation and Expression
- BHCast: Unlocking Black Hole Plasma Dynamics from a Single Blurry Image with Long-Term Forecasting
- Vision-Speech Models: Teaching Speech Models to Converse about Images
- LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents
- MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model
- Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models
- FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting
- Neural Distribution Prior for LiDAR Out-of-Distribution Detection
- MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label
- Affordance-First Decomposition for Continual Learning in Video–Language Understanding
- EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses
- PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image
- RevINN: An End-to-End Invertible Neural Network for Reversible Adversarial Examples Generation
- S2C2Seg: Semantic-Spatial Consistency and Category Optimization for Open-Vocabulary Segmentation
- JRM: Joint Reconstruction Model for Multiple Objects without Alignment
- Video2Robo: 3DGS-based Synthetic Data from One Video Enables Scalable Robot Learning
- Variation-aware Vision Token Dropping for Faster Large Vision-Language Models
- A Debiased Reconstruction-based Framework for Training-Free Detection of AI-Generated Images
- Momentum Memory for Knowledge Distillation in Computational Pathology
- STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative
- HySeg: Learning Generative Priors for Structure-Aware Remote Sensing Segmentation
- LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing
- Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
- ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization
- MD2E: Modeling Depth-to-Edge Cues for Monocular Metric Depth Estimation
- Fast SceneScript: Fast and Accurate Language‑Based 3D Scene Understanding via Multi‑Token Prediction
- X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection
- The Missing GAP: From Solving Square Jigsaw Puzzles to Handling Real World Archaeological Fragments
- Don’t Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs
- RealBirdID: Benchmarking Bird Species Identification in the Era of MLLMs
- Material Magic Wand: Material-Aware Grouping of 3D Parts in Untextured Meshes
- PRISM: Learning a Shared Primitive Space for Transferable Skeleton Action Representation
- UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision
- WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
- MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding
- Curvature-Aware Zeroth-Order Optimization for Memory-Efficient Test-Time Adaptation
- Scalable Trajectory Generation for Whole-Body Mobile Manipulation
- Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI
- World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models
- Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
- Neural Dynamic GI: Random-Access Neural Compression for Temporal Lightmaps in Dynamic Lighting Environments
- Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration
- Real-Time Multimodal Fingertip Contact Detection via Depth and Motion Fusion for Vision-Based Human–Computer Interaction
- ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior
- Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach
- Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
- MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second
- SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning
- EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing
- View-Aware Semantic Alignment for Aerial-Ground Person Re-Identification
- Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction
- PatchAlign3D: Local Feature Alignment for Dense 3D Shape Understanding
- SSM-Aware Token-Efficient VMamba via Adaptive Patch Pruning and Merging for Person Re-Identification
- GenErase: Generalizable and Semantically-Aware Concept Erasure in Diffusion Models
- FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation
- MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models
- Recovering Physically Plausible Human-Object Interactions from Monocular Videos
- Not All Birds Look The Same: Identity-Preserving Generation For Birds
- One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework
- CraftMesh: High-Fidelity Generative Mesh Manipulation via Poisson Seamless Fusion
- Are Image-to-Video Models Good Zero-Shot Image Editors?
- CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation
- Nonparametric Deep Fine-grained Clustering with Low-Rank Guided Vision-Language Model
- SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation
- Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image
- DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces
- Harnessing the Power of Foundation Models for Accurate Material Classification
- Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy
- PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization
- ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images
- Teacher-Guided Routing for Sparse Vision Mixture-of-Experts
- RMIR: A Benchmark Dataset for Reasoning-Intensive Multimodal Image Retrieval
- PhysHead: Simulation-Ready Gaussian Head Avatars
- Act2See: Emergent Active Visual Perception for Video Reasoning
- Hugging Visual Prompt and Segmentation Tokens: Consistency Learning for Fine-Grained Visual Understanding in MLLMs
- Bidirectional Normalizing Flow: From Data to Noise and Back
- AXG-Reasoner: Error Detection and Explanation in Long Task Videos with Vision–Language Models
- DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures
- Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation
- Lifting Unlabeled Internet-level Data for 3D Scene Understanding
- MOSAIC-GS: Monocular Scene Reconstruction via Advanced Initialization for Complex Dynamic Environments
- Render-to-Adapt: Unsupervised Personal Adaptation for Gaze Estimation
- The Surprising Effectiveness of Noise Pretraining for Implicit Neural Representations
- Unblur-SLAM: Dense Neural SLAM for Blurry Inputs
- LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models
- Bootstrapping Multi-view Learning for Test-time Noisy Correspondence
- C-GenReg: Training-Free 3D Point Cloud Registration by Multi-View-Consistent Geometry-to-Image Generation with Probabilistic Modalities Fusion
- Probing and Bridging Geometry–Interaction Cues for Affordance Reasoning in Vision Foundation Models
- ManifoldNeuS: Manifold-aware View Optimizability for Pose-Free Neural Surface Reconstruction
- TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures
- Language-Free Generative Editing from One Visual Example
- VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation
- Breaking Spurious Correlations: Uncertainty-Driven Causal Transformers for AU Detection
- DABO: Difficulty-Aware Bayesian Optimization with Diffusion-Learned Priors
- IR-HGP: Physically-Aware Gaussian Inverse Rendering for High-Illumination Scenes via Generative Priors
- Bias In, Bias Out? Finding Unbiased Subnetworks in Vanilla Models
- Unified Spherical Frontend: Learning Rotation-Equivariant Representations of Spherical Images from Any Camera
- Query2Uncertainty: Robust Uncertainty Quantification and Calibration for 3D Object Detection under Distribution Shift
- CG-Reasoner: Centroid-Guided Positional Reasoning Segmentation for Medical Imaging with a Robust Visual-Text Consistency Metric
- Understanding Counting Mechanisms in Large Language and Vision-Language Models
- VGA: Empowering Aerial-Ground Localization by Visual Geometry Alignment
- Adaptive Spatial-Temporal Window: Unlocking the Potential of Event Cameras in Heterogeneous Velocity Scenarios
- Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion
- iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception
- S^2-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance
- Towards Knowledge-augmented Bayesian Deep Learning For Computer Vision
- Does YOLO Really Need to See Every Training Image in Every Epoch?
- MOFA-VTON: More Fashion Possibilities with Fine-Grained Adaptations in Virtual Try-On
- Video Generation with Stable Transparency via Shiftable RGB-A Distribution Learner
- HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models
- Diffusion-Based Makeup Transfer with Facial Region-Aware Makeup Features
- Interpretable and Steerable Concept Bottleneck Sparse Autoencoders
- DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization
- Long-Tail Internet Photo Reconstruction
- Learning Compact 3D Representations from Feed-Forward Novel View Synthesis
- Asynchronous Temporal Modeling with Two-Agent Framework for Streaming Dense Video Captioning
- fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding
- From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition
- Efficient Real-Time Raw-to-Raw Denoising for Extreme Low-Light Ultra HD Video on Mobile Devices
- REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting
- DENALI: A Dataset Enabling Non-Line-of-Sight Spatial Reasoning with Low-Cost LiDARs
- Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models
- A Combination of Noise and Bilateral Filters Achieve Supralinear and Scalable Adversarial Robustness in CNNs
- When to Think and When to Look: Uncertainty-Guided Lookback
- HandX: Scaling Bimanual Motion and Interaction Generation
- rPPG-VQA: A Video Quality Assessment Framework for Unsupervised rPPG Training
- Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer
- Synergistic Bleeding Region and Point Detection in Laparoscopic Surgical Videos
- MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation
- UNI-OOD: Unified Object- and Image-level Out-of-Distribution Detection via Cross-Context Attentive Vision-Language Modeling
- HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
- Spatial-Frequency Collaborative Learning for Occluded Visible-Infrared Person Re-Identification
- CausalLens: Sensitivity-Guided Multi-Head Causal Intervention for Hallucination Mitigation in Large Vision-Language Models
- Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control
- Incentivizing Generative Zero-Shot Learning via Outcome-Reward Reinforcement Learning with Visual Cues
- GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving
- Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability
- HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering
- Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning
- OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition
- DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
- Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding
- 3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation
- VesMamba: 3D Pulmonary Vessel Segmentation from CT images via Mamba with Structural Perception and Scale-aware Filtering
- Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs
- MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark
- Learning Personalized Photographic Style from Pairwise User Preferences
- ImageRAGTurbo: Towards One-step Text-to-Image Generation with Retrieval-Augmented Diffusion Models
- Hierarchical Action Learning for Weakly-Supervised Action Segmentation
- Geometric Neural Distance Fields for Learning Human Motion Priors
- OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
- RawMetaDiff: Unlocking Extreme Darkness from Dual-Exposure RAW with Meta-Guided Diffusion
- Rethinking Dataset Distillation: Hard Truths about Soft Labels
- From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking
- VideoMaMa: Mask-Guided Video Matting via Generative Prior
- Reinforcing Video Reasoning Segmentation to Think Before It Segments
- Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking from Sparse Inertial Sensors and Ranging-based Between-sensor Distances
- SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals
- How Much 3D Do Video Foundation Models Encode?
- Efficient Weighted Sampling via Score-based Generative Models
- Region-Adaptive Sampling for Diffusion Transformers
- Transition Matching Distillation for Fast Video Generation
- Correspondence-Attention Alignment for Multi-View Diffusion Models
- Beyond Duality: A Hybrid Framework of Leveraging Shared and Private Features for RGB-Event Object Detection
- StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering
- FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips
- Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing
- Mind the Gap: Transferring Labels to Align Object Detection Datasets
- Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision
- NOWA: Null-space Optical Watermark for Invisible Capture Fingerprinting and Tamper Localization
- 3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding
- MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation
- Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints
- VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking
- FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Alignment
- AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers
- Evidential Transformation Network: Turning Pretrained Models into Evidential Models for Post-hoc Uncertainty Estimation
- LiDeRe: A Lightweight Readout for Fast and Data-Efficient Dense Prediction
- EmoStyle: Emotion-Driven Image Stylization
- A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space
- Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding
- EG-3DVG: Expression and Geometry Aware Grounding Decoder for 3D Visual Grounding
- NexusFlow: Unifying Disparate Tasks under Partial Supervision via Invertible Flow Networks
- Dual Band Thermal Videography: Separating Time-Varying Reflection and Emission Near Ambient Conditions
- LLM-Guided Probabilistic Fusion for Label-Efficient Document Layout Analysis
- DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images
- From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings
- LNEM: Lunar Neural Elevation Model
- GenMatter: Perceiving Physical Objects with Generative Matter Models
- Physical Simulator In-the-Loop Video Generation
- Learning Multi-View Spatial Reasoning from Cross-View Relations
- FlashDecoder: Real-Time Latent-to-Pixel Streaming Decoder with Transformers
- Unsafe2Safe: Controllable Image Anonymization for Downstream Utility
- EmoThinker: Advancing Visual-Acoustic Emotion Analysis via Structural Token Selection and Chain-of-Thought Reasoning
- ReflexSplit: Single Image Reflection Separation via Layer Fusion-Separation
- Scalable Feature Matching via State Space Modeling and Sparse Correlation
- InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
- NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training
- DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
- Functional Mean Flow in Hilbert Space
- MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters
- MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
- Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression
- FlowPortal: Residual-Corrected Flow for Training-Free Video Relighting and Background Replacement
- MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding
- E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction
- Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning
- Exploring Spatial Intelligence from a Generative Perspective
- Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding
- Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization
- SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation
- Quantized Residuals to Continuous Prompts for Few-Shot Class Incremental Learning in Vision-Language Models
- Perception Characteristics Distance: Measuring Stability and Robustness of Perception System in Dynamic Conditions under a Certain Decision Rule
- X-band Radar Non-Line-of-Sight Imaging
- TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies
- Conditional Factuality Controlled LLMs with Generalization Certificates via Conformal Sampling
- ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions
- CompBench: Benchmarking Complex Instruction-guided Image Editing
- Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
- Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion
- Shedding Light on VLN Robustness: A Black-box Framework for Indoor Lighting-based Adversarial Attack
- LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models
- AutoTraces: Autoregressive Trajectory Forecasting via Multimodal Large Language Models
- Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation
- Active Perceptual Inference: A Corticothalamic-Inspired Dynamic Nested Recurrent Network for Multimodal Sentiment Analysis with Incomplete Data
- TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation
- Advancing Image Classification with Discrete Diffusion Classification Modeling
- Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists
- Beyond Depth: Evaluating the Width-centric Reasoning Capability of MLLMs
- VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
- FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control
- Inferring Compositional 4D Scenes without Ever Seeing One
- Consistent Instance Field for Dynamic Scene Understanding
- Refaçade: Editing Object with Given Reference Texture
- PromptMoE: A Segmentation Refinement Framework Leveraging Mixture of Experts for Improved Prompting
- SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting
- DeepProtect: Proactive Face-Swapping Defense using Identity Blending and Attribute Distortion
- Stabilizing Feature Geometry in Noisy Pretrained Models for Robust Downstream Tasks
- Bidirectional Query-Driven Generation of Parametric CAD Sketch
- VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection
- ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation
- Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo
- DemoFunGrasp: Universal Dexterous Functional Grasping via Demonstration-Editing Reinforcement Learning
- SAIDO: Generalizable Detection of AI-Generated Images via Scene-Aware and Importance-Guided Dynamic Optimization in Continual Learning
- ORBIT: Benchmarking SfM in the Wild with 360° Video
- NimbusGS: Unified 3D Scene Reconstruction under Hybrid Weather
- FedDAP: Domain-Aware Prototype Learning for Federated Learning under Domain Shift
- MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images
- Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation
- Emergent Outlier View Rejection in Visual Geometry Grounded Transformers
- Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models
- Time Without Time: Pseudo-Temporal Representation for Space-Time Super-Resolution
- No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors
- HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics
- Training-free, Perceptually Consistent Low-Resolution Previews with High-Resolution Image for Efficient Workflows of Diffusion Models
- Source Models Leak What They Shouldn’t: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization
- LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes
- Meta-Learning In-Context Enables Training-Free Cross Subject Brain Decoding
- GFRRN: Explore the Gaps in Single Image Reflection Removal
- DynamicTree: Interactive Real Tree Animation via Sparse Voxel Spectrum
- VoDaSuRe: A Large-Scale Dataset Revealing Domain Shift in Volumetric Super-Resolution
- Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field
- PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing
- Beyond Heuristic Prompting: A Concept-Guided Bayesian Framework for Zero-Shot Image Recognition
- Learning to Act Robustly with View-Invariant Latent Actions
- Decoupled Generative Modeling for Human-Object Interaction Synthesis
- Multi-Scale Local Speculative Decoding for Image Generation
- Hierarchical Concept Embedding & Pursuit for Interpretable Image Classification
- Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
- HyperST: Hierarchical Hyperbolic Learning for Spatial Transcriptomics Prediction
- Intra-class Distribution-guided Generative Hashing with Neighbor Refinement for Cross-modal Retrieval
- Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset
- Seeing through boxes: Non-Line-of-Sight 3D Reconstruction from Radar Signals
- The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation
- Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis
- EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation
- AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation
- Twin-T & TwintVQA: A Reliable Structure–Detail Separating VLM and a Comprehensive Benchmark for Chart and Table Tasks
- VISTA: A Test-Time Self-Improving Video Generation Agent
- Refracting Reality: Generating Images with Realistic Transparent Objects
- Efficient All-Pairs Correlation Volume Sampling for Optical Flow Estimation
- GeniNav: Generative Model Driven Image-Goal Navigation via Imagination-Guided Consistency Flow Matching
- Agile Deliberation: Concept Deliberation for Subjective Visual Classification
- PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction
- IF-Prune: Information-Flow Guided Token Pruning for Efficient Vision-Language Models
- AnyPcc: Compressing Any Point Cloud with a Single Universal Model
- ViT^3: Unlocking Test-Time Training in Vision
- Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment
- OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models
- Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning
- OSA: Echocardiography Video Segmentation via Orthogonalized State Update and Anatomical Prior-aware Feature Enhancement
- The Devil Is in Gradient Entanglement: Energy-Aware Gradient Coordinator for Robust Generalized Category Discovery
- MMVIP: A Visible-infrared Paired Dataset for Multi-weather Marine Vision
- Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
- Global Underwater Geolocation from Time-Lapse Polarization Imagery
- Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets
- VES-RFT: Rewarding Visual Evidence Sensitivity to Mitigate Hallucinations in Large Vision–Language Models
- UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting
- ProjFlow: Projection Sampling with Flow Matching for Zero‑Shot Exact Spatial Motion Control
- Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species
- Prime Once, then Reprogram Locally: An Efficient Alternative to Black-Box Service Model Adaptation
- Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
- InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy
- Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning
- Exact-GS: Mathematically Rigorous and Accurate 3D Gaussian Splatting for 3D X-ray Reconstruction
- PAUL: Uncertainty-Guided Partition and Augmentation for Robust Cross-View Geo-Localization under Noisy Correspondence
- Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization
- Concept-Aware Batch Sampling Improves Language-Image Pretraining
- FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain
- Event-based Motion Deblurring with Unpaired Data
- Envisioning the Future, One Step at a Time
- GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction
- Globscope: Toward a Global View of the Loss Landscape
- Global Information Thresholding for Sufficient and Necessary Circuits
- QuietPrune: Query-Guided Early Token Pruning for Vision-Language Models
- RHCNet: Residual-Guided Hierarchical Calibration Network for Robust Underwater Object Detection
- MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation
- Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds
- Physical Object Understanding with a Physically Controllable World Model
- Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs
- A Geometric Algebra-Informed 3DGS Framework for Wireless Channel Prediction
- RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cue for 3D Object Detection
- Event Structural Valley: A Unified Theoretical and Practical Framework for Event Camera Autofocus
- A Training-Free Style-Personalization via SVD-Based Feature Decomposition
- Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery
- Extend3D: Town-Scale 3D Generation
- Label-Free Cross-Task LoRA Merging with Null-Space Compression
- VAD-GS: Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes
- Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions
- Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves
- Reflection Separation from a Single Image via Joint Latent Diffusion
- BiGMINT: Biologically-guided Hierarchical Multimodal Integration for Modeling Multiple Compound Activities in Drug Discovery
- UniPR: Unified Object-level Real-to-Sim Perception and Reconstruction from a Single Stereo Pair
- QueryMe: Query-Driven Open-Vocabulary 3D Object Affordances Grounding from Multimodal Evidence
- WaDi: Weight Direction-aware Distillation for One-step Image Synthesis
- UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
- Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment
- The Midas Touch for Metric Depth
- SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning
- TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models
- Divide, Conquer, and Aggregate: Asymmetric Experts for Class-Imbalanced Semi-Supervised Medical Image Segmentation
- HumanBA: Human-Aware Bundle Adjustment via Global Human-Camera Decoupling
- Out of Sight, Out of Track: Adversarial Attacks on Propagation-based Multi-Object Trackers via Query State Manipulation
- Spatio-Temporal Conditional Denoising Transformer for Modality-Missing RGBT Tracking
- HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance
- GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering
- VL-RouterBench: A Benchmark for Vision–Language Model Routing
- STARFlow-V: End-to-End Video Generative Modeling with Autoregressive Normalizing Flows
- GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling
- Gyro-based Deep Video Deblurring
- DeepAlign: Mitigating Modality Conflict through Modality-Specific Alignment
- Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model
- MSPT: Efficient Large-Scale Physical Modeling via Parallelized Multi-Scale Attention
- COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation
- Differentiable Adaptive 4D Structured Illumination for Joint Capture of Shape and Reflectance
- PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing
- Splatent: Splatting Diffusion Latents for Novel View Synthesis
- AvatarPointillist: AutoRegressive 4D Gaussian Avatarization
- Universal 3D Shape Matching via Coarse-to-Fine Language Guidance
- SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation
- Tackling Alignment Ambiguity in Person Retrieval through Conversational Attribute Mining
- SyncMos: Scalable Motion Synchronisation for Multi-Agent Scene Interaction
- CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space
- Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
- Make it SING: Analyzing Semantic Invariants in Classifiers
- TGSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion
- VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression
- Continual Distillation of Teachers from Different Domains
- Spectral Scalpel: Amplifying Adjacent Action Discrepancy via Frequency-Selective Filtering for Skeleton-Based Action Segmentation
- Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion
- TopoSlide: Topologically-Informed Histopathology Whole Slide Image Representation Learning
- Parameter-efficient Continual Learning for Enhancing Plasticity without Forgetting under Limited Model Capacity
- All in One: Unifying Deepfake Detection, Tampering Localization, and Source Tracing with a Robust Landmark-Identity Watermark
- HG-Lane: High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation
- Real-Time Dynamic Scene Rendering with Controlled Compressibility and Contact Awareness
- Building a Precise Video Language with Human–AI Oversight
- Illumination-Consistent Human-Scene Reconstruction from Monocular Video
- Better than Average: Spatially-Aware Aggregation of Segmentation Uncertainty Improves Downstream Performance
- Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation
- Learning Effective Sign Features without Text for Gloss-free Sign Language Translation
- GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning
- Linguistic Priors for Visual Decoupling: Towards Symmetric Vision-Brain Alignment
- ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes
- Neuro-Cognitive Reward Modeling for Human-Centered Autonomous Vehicle Control
- SPEGC: Continual Test-Time Adaptation via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation
- A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models
- PointCSP: Cross-Sample Semantic Propagation and Stability Preservation in Self-Supervised Point Cloud Learning
- TRM-VLA: Temporal-Aware Chain-of-Thought Reasoning and Memorization for Vision-Language-Action Models
- GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation
- Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans
- ChimeraLoRA: Multi-Head LoRA-Guided Synthetic Datasets
- InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space
- Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning
- MV2UV: Generating High-quality UV Texture Maps with Multiview Prompts
- HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models
- STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction
- Semi-supervised Echocardiography Video Segmentation via Anchor Semantic Awareness and Continuous Pseudo-label Reforging
- STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval
- UZ3DVG: Unaided Zero-Shot 3D Visual Grounding with Generated Language Conditions
- AdaPrior: Bayesian-Inspired Adaptive Prior Correction for Long-Tailed Continual Learning
- Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning
- Anchoring and Rescaling Attention for Semantically Coherent Inbetweening
- ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation
- Contact-Aware Neural Dynamics
- GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport
- THE MORE, THE MERRIER: CONTRASTIVE FUSION FOR HIGHER-ORDER MULTIMODAL ALIGNMENT
- Occluded Human Body Capture with Frequency Domain Denoising Prior
- EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer
- From Manuals to Actions: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation
- RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting
- Probabilistic Discrepancy Learning for Roadside LiDAR Scene Completion
- VDOT: Efficient Unified Video Creation via Optimal Transport Distillation
- LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference
- PhyGaP: Physically-Grounded Gaussians with Polarization Cues
- Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer
- The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection
- Improving Controllable Generation: Faster Training and Better Performance via x0-Supervision
- ChronoGS: Disentangling Invariants and Changes in Multi-Period Scenes
- Your Dissimilarities Define You: Complementary Learning Exploiting Class Diversities
- R3-PCQA: Ray-Reprojection-Reinforcement for No-Reference 3D Point Cloud Quality Assessment
- Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition
- Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes
- UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass
- Will Multimodal Models Be Dazzled by Multi-Image Visual Puzzles?
- Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation
- Composite-Attribute Person Re-Identification via Pose-Guided Disentanglement
- Inconsistency-aware Multimodal Schrödinger Bridge for Deepfake Localization
- EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing
- Edit-aware RAW reconstruction
- OVI-MAP: Open-Vocabulary Instance-Semantic Mapping
- MaxMark: High-Capacity Diffusion-Native Watermarking via Robust and Invertible Latent Embedding
- Reinforcing Structured Chain-of-Thought for Video Understanding
- Bridging Domains through Subspace-Aware Model Merging
- Representing 3D Faces with Learnable B-Spline Volumes
- CRAFT-LoRA: Content-Style Personalization via Rank-Constrained Adaptation and Training-Free Fusion
- GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance
- Globally Optimal Pose from Orthographic Silhouettes
- PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow
- Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives
- Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection
- Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions
- Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures
- Gau-Occ: Geometry-Completed Gaussians for Multi-Modal 3D Occupancy Prediction
- MeshRipple: Structured Autoregressive Generation of Artist-Meshes
- SAM 3D: 3Dfy Anything in Images
- E-3DPSM: A State Machine for Event-based Egocentric 3D Human Pose Estimation
- MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding
- Learnability-Driven Submodular Optimization for Active Roadside 3D Detection
- MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer
- UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation
- LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment
- SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings
- CLIP Is Shortsighted: Paying Attention Beyond the First Sentence
- NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
- Eliminate Distance Differences Induced by Backdoor Attacks: Layer-Selective Training and Clipping to Mask Backdoor Models
- CoLoR: The Devil is in Scene Coordinate Regression for Large-Scale Visual Localization
- Scene-Centric Unsupervised Video Panoptic Segmentation
- CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection
- Translating Signals to Languages for sEMG-Based Activity Recognition
- Vinedresser3D: Towards Agentic Text-guided 3D Editing
- TAMER: A Tri-Modal Contrastive Alignment and Multi-Scale Embedding Refinement Framework for Zero-Shot ECG Diagnosis
- Omni-MMSI: Toward Identity-attributed Social Interaction Understanding
- Ref4D-VideoBench: Four-Dimensional Reference-Based Evaluation of Text-to-Video Generative Models
- What Matters in Practical Learned Image Compression
- PureProof: Diffusion-Resistant Black-box Targeted Attack on Large Vision-Language Models
- The Road Less Seen: Segment Exploration for Weakly Supervised Video Anomaly Detection
- LiDAS: Lighting-driven Dynamic Active Sensing for Nighttime Perception
- PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning
- Clair Obscur: an Illumination-Aware Method for Real-World Image Vectorization
- Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern
- An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving
- Semantic-Adaptive Diffusion for Dynamic Spatiotemporal Fusion
- TruckDrive: Long-Range Autonomous Highway Driving Dataset
- First Frame Is the Place to Go for Video Content Customization
- SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation
- SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models
- R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment
- OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar
- WebGym: Scaling Training Environments for Long-Horizon Visual Web Agents with Realistic Tasks
- Frequency-domain Manipulation for Face Obfuscation
- Test-Time Multi-Prompt Adaptation for Open-Vocabulary Remote Sensing Image Segmentation
- HierUQ: Hierarchical Uncertainty Quantification with Adaptive Granularity Reconciliation for Degraded Image Classification
- DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers
- D-Prism: Differentiable Primitives for Structured Dynamic Modeling
- An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning
- FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation
- Local Motion Matters: A Deconstruct–Recompose Paradigm for Reinforcement Learning Pre-training from Videos
- HAD: Heterogeneity-Aware Distillation for Lifelong Heterogeneous Learning
- WildCap: Facial Albedo Capture in the Wild via Hybrid Inverse Rendering
- Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models
- Learning to Refuse: Refusal-Aware Reinforcement Fine-Tuning for Hard-Irrelevant Queries in Video Temporal Grounding
- GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning
- UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register
- InterRVOS: Interaction-Aware Referring Video Object Segmentation
- Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection
- TrackMAE: Video Representation Learning via Track Mask and Predict
- RARE: Learn to RAnk and REtrieve for Monocular 3D Object Detection
- Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling
- VA-π: Variational Policy Alignment for Pixel-Aware Autoregressive Generation
- MotionEdit: Benchmarking and Learning Motion-Centric Image Editing
- Stable Mean Flow: Lyapunov-Inspired One-Step Flow Matching
- A³: Towards Advertising Aesthetic Assessment
- PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction
- P2GS: Physical Prior-guided Gaussian Splatting for Photometrically Consistent Urban Reconstruction
- Simple Agents Outperform Experts in Biomedical Imaging Workflow Optimization
- D2FANet: Enhancing Video Object Detection with Dual-Domain Feature Aggregation Network
- META: Meta Evolution of Tool Trajectory Adaptation for Long-Video Understanding
- Visual-RRT: Finding Paths toward Visual-Goals via Differentiable Rendering
- SPDMark: Selective Parameter Displacement for Robust Video Watermarking
- ResiHMR: Residual-Limb Aware Single-Image 3D Human Mesh Recovery for Individuals with Limb Loss
- CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction
- OVSegDT: Segmenting Transformer for Open-Vocabulary Object Goal Navigation
- 4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction
- STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction
- MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting
- Radiance Meshes for Volumetric Reconstruction
- Memory-Efficient Fine-Tuning Diffusion Transformers via Dynamic Patch Sampling and Block Skipping
- MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation
- When Local Rules Create Global Order: Self-Organized Representation Learning for Latent Diffusion Models
- E^2-SCI: Elastic Edge–Cloud Speculative Decoding via Credit Inertia
- MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis
- RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces
- Affostruction: 3D Affordance Grounding with Generative Reconstruction
- Variational Graph-based Normal Integration
- Affine Perspective-Three-Point Problem
- Breaking Multimodal LLM Safety via Video-Driven Prompting
- Energy-GS: Image Energy-guided Pose Alignment Gaussian Splatting with redesigned pose gradient flow
- PRIMU: Uncertainty Estimation for Novel Views in Gaussian Splatting from Primitive-Based Representations of Error and Coverage
- SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation
- DFD-HR: Generalizable Deepfake Detection via Hierarchical Routing Learning
- DRM: Diffusion-based Reward Model With Step-wise Guidance
- Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning
- Rethinking Occlusion Modeling for UAV Tracking
- Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post‑hoc Debiasing in Vision-Language Models
- TiViBench: Benchmarking Think-in-Video Reasoning for Video Generation
- Enhancing Accuracy of Uncertainty Estimation in Appearance-based Gaze Tracking with Probabilistic Evaluation and Calibration
- HamiPose: Hamiltonian Optimization for Unsupervised Domain Adaptive Pose Estimation
- E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving
- MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering
- Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration
- Align Once to Explain: Feature Alignment for Scalable B-cosification of Foundational Vision Transformers
- Scaling Dense Event-Stream Pretraining from Visual Foundation Models
- S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs
- Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video
- RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward
- Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation
- COPE: Consistent Occlusion and Prompt Enhancement Network for Occluded Person Re-identification
- DiffSoup: Direct Differentiable Rasterization of Triangle Soup for Extreme Radiance Field Simplification
- SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
- Artiverse: A Diverse and Physically Grounded Dataset for Articulated Objects
- OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness
- SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models
- Learning to See Through a Baby’s Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines
- SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting
- Towards Sparse Video Understanding and Reasoning
- Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization
- Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass
- Yume1.5: A Text-Controlled Interactive World Generation Model
- MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction
- DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance
- Anti-Degradation Lifelong Multi-View Clustering
- Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation
- Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers
- Common Inpainted Objects In-N-Out of Context
- SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding
- EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization
- Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following
- Erasing Thousands of Concepts: Towards Scalable and Practical Concept Erasure for Text-to-Image Diffusion Models
- Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces
- LAMP: Language-Assisted Motion Planning for Controllable Video Generation
- Lynx: Towards High-Fidelity Personalized Video Generation
- Phrase-Grounding-Aware Supervised Fine-Tuning for Chart Recognition via Side-Masked Attention
- U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation
- Towards Generalized Multimodal Homography Estimation
- Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment
- EfficientMonoHair: Fast Strand-Level Reconstruction from Monocular Video via Multi-View Direction Fusion
- MAMMA: Markerless Accurate Multi-person Motion Acquisition
- TerraSeg: Self-Supervised Ground Segmentation for Any LiDAR
- Physically Inspired Gaussian Splatting for HDR Novel View Synthesis
- SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation
- Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving
- BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers
- OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera
- Optical Diffraction-based Convolution for Semiconductor Lithography
- Revisiting Optimal Coding for I-ToF under Practical Sensor Constraints
- Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection
- What Is the Optimal Ranking Score Between Precision and Recall? We Can Always Find It and It Is Rarely F1
- Prompt-Free Universal Region Proposal Network
- NanoSD: Edge Efficient Foundation Model for Real Time Image Restoration
- SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time
- POUR: A Provably Optimal Method for Unlearning Representation via Neural Collapse
- Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
- TokenHand: Discrete Token Representation for Efficient Hand Mesh Reconstruction
- BiOTPrompt: Bidirectional Optimal Transport Guided Prompting for Disease Evolution-aware Radiology Report Generation
- Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR
- Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
- SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts
- Protego: User-Centric Pose-Invariant Privacy Protection Against Face Recognition-Induced Digital Footprint Exposure
- Ego: Embedding-Guided Personalization of Vision-Language Models
- Learning What Helps: Task-Aligned Context Selection for Vision Tasks
- MR. Illuminate: Zero-Shot Low-Light Image Enhancement with Diffusion Prior
- ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding
- APPO: Attention-guided Perception Policy Optimization for Video Reasoning
- What Are You Doing? A Closer Look at Controllable Human Video Generation
- Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching
- Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs
- Towards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels
- UAST: Unified Active Search and Tracking for Arbitrary Targets with UAVs
- Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator
- GeoFree-CoSeg: Unsupervised Point Cloud-Image Cross-Modal Co-Segmentation Without Geometric Alignment
- ID-Sim: An Identity-Focused Similarity Metric
- SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model
- DiffusionFF: A Diffusion-based Framework for Joint Face Forgery Detection and Fine-Grained Artifact Localization
- GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks
- PGA: Prior-free Generative Attack for Practical No-box Scenario
- Lipschitz Optimization for Formal Verification of Homographies
- Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark
- DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation
- Visual Grounding for Object Questions
- OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery
- ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization
- MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models
- Dual-Prototype-Guided Multi-task Learning for Unsupervised Anomaly Detection and Classification
- Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations
- PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and VLM-Guided Optimization
- UniRain: Unified Image Deraining with RAG-based Dataset Distillation and Multi-objective Reweighted Optimization
- Language Does Matter for Cross-Domain Few-Shot Visual Feature Enhancement
- HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation
- ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands
- CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing
- SAR2Net: Learning Spatially Anchored Representations for Retrieval-Guided Cross-Stain Alignment
- AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video
- Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs
- OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation
- BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections
- HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition
- Z-Order Transformer for Feed-Forward Gaussian Splatting
- UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation
- Style-GRPO: Semantic-Aware Preference Optimization for Image Style Transfer Guided by Reward Modeling
- Do Vision-Language Models Leak What They Learn? Adaptive Token-Weighted Model Inversion Attacks
- Paparazzo: Active Mapping of Moving 3D Objects
- PHANTOM: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
- Robust Remote Sensing Image–Text Retrieval with Noisy Correspondence
- OrionEdit: Bridging Reference and Source Images for Generalized Cross-Image Editing
- Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy
- Multigrain-aware Semantic Prototype Scanning and Tri-Token Prompt Learning Embraced High-Order RWKV for Pan-Sharpening
- CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models
- ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery
- GR-Gauge: Cost-efficient Training Configuration By Gauging the Gradient Redundancy
- Disco-GS: Gaussian Splatting in Dynamic Color Lighting
- ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
- From Few-way to Many-way: Rethinking Few-shot Fine-grained Image Classification
- DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning
- Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling
- Residual Primitive Fitting of 3D Shapes with SuperFrusta
- NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity
- Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
- Generalizable Radio-Frequency Radiance Fields for Spatial Spectrum Synthesis
- Leveraging Multispectral Sensors for Color Correction in Mobile Cameras
- LVLM-Aided Alignment of Task-Specific Vision Models
- ClusterMark: Towards Robust Watermarking for Autoregressive Image Generators with Visual Token Clustering
- BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation
- FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs
- Rectifying Latent Space for Generative Single-Image Reflection Removal
- AVION: Aerial Vision–Language Instruction from Offline Teacher to Prompt-Tuned Network
- CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Bird’s-Eye-View Semantic Segmentation
- Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation
- Real-World Point Tracking with Verifier-Guided Pseudo-Labeling
- PaNDaS: Learnable Shape Interpolation Modeling with Localized Control
- Selection-as-Nonlinearity: Bridging Attention and Activation via a Joint Game–Decision Lens for Interpretable, Discriminative Visual Representations
- CIGMA: Causal Information-Gain Mechanistic Attribution of Attention Heads in Vision Transformers
- G^2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
- Beyond Success: Refining Elegant Robot Manipulation from Mixed-Quality Data via Just-in-Time Intervention
- CGU-Bayes: Causal Graph Uncertainty-Guided Bayesian Inference for Domain Generalization
- Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset
- Cross-Hand Latent Representation for Vision-Language-Action Models
- StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars
- Geometry-Guided 3D Visual Token Pruning for Video-Language Models
- Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
- MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy
- TANGO: Learning Distribution-wise Foundation Prior Consistency and Instance-wise Style Calibration for Medical Image Generalization
- Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers
- FUSER: Feed-Forward Multiview 3D Registration Transformer and SE(3)^N Diffusion Refinement
- Dynamic Momentum Recalibration in Online Gradient Learning
- Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
- COT-FM: Cluster-wise Optimal Transport Flow Matching
- ForceVLA2: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation
- VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
- PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency
- Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation
- PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training
- Aesthetic Camera Viewpoint Suggestion with 3D Aesthetic Field
- Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding
- UniCorrn: Unified Correspondence Transformer Across 2D and 3D
- Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models
- Rethinking Concept Bottleneck Models: From Pitfalls to Solutions
- Residual Diffusion Bridge Model for Image Restoration
- Hearing the Room Through the Shape of the Drum: Modal-Guided Sound Recovery from Multi-Point Surface Vibrations
- Bridging Facial Understanding and Animation via Language Models
- Ghosts in the Point Clouds: De-glaring LiDAR in the Transient Domain
- Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers
- Ego-1K – A Large-Scale Multiview Video Dataset for Egocentric Vision
- RAID: Retrieval-Augmented Anomaly Detection
- Confidence-Guided Multi-Scale Aggregation for Sparse-View High-Resolution 3D Gaussian Splatting
- Predicting Spatial Transcriptomics from Histology Images via High-Order Multi-Cell Interaction Modeling
- FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision
- TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos
- Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study
- Dynamic Exposure Burst Image Restoration
- Layered 4D-Rotor Gaussian Splatting: A Compressed Representation for Long Dynamic Scenes
- Co-Me: Confidence Guided Token Merging for Visual Geometric Transformers
- M^3KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation
- Any4D: Unified Feed-Forward Metric 4D Reconstruction
- DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution
- Kaleidoscopic Scintillation Event Imaging
- Self-Consistency for LLM-Based Motion Trajectory Generation and Verification
- CycleManip: Enabling Cycle-based Manipulation via Effective History Perception and Understanding
- CARD: Correlation Aware Restoration with Diffusion
- Event6D: Event-based Novel Object 6D Pose Tracking
- Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding
- SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark
- OpenMarcie: Dataset for Multimodal Action Recognition in Industrial Environments
- YOLO-ULM: Ultra-Lightweight Models for Real-Time Object Detection
- Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs
- HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image
- A2GC: Asymmetric Aggregation with Geometric Constraints for Locally Aggregated Descriptors
- Learning complete and explainable visual representations from itemized text supervision
- Foundry: Distilling 3D Foundation Models for the Edge
- UniSER: A Foundation Model for Unified Soft Effects Removal
- Condensed Test-Time Adaptation of VLMs for Action Recognition
- GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping
- L^2DGS: Low-Light Dynamic Gaussian Splatting
- Grid Distillation: Compositional Image Distillation via Structured Generative Grids
- ExpPortrait: Expressive Portrait Generation via Personalized Representation
- Delta Rectified Flow Sampling for Text-to-Image Editing
- When Transformers Meet Mamba: A Hybrid Transformer-Mamba Network for Video Object Detection
- Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering
- PhysGS: Bayesian-Inferred Gaussian Splatting for Physical Property Estimation
- Faster-GS: Analyzing and Improving Gaussian Splatting Optimization
- Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals
- Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
- AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects
- Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness
- Best Segmentation Buddies for Image-Shape Correspondence
- FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation
- Unlocking Positive Transfer in Incrementally Learning Surgical Instruments: A Self-reflection Hierarchical Prompt Framework
- Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video
- CSF: Black-box Fingerprinting via Compositional Semantics for Text-to-Image Models
- LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds
- Enhancing Continual Learning of Vision-Language Models via Dynamic Prefix Weighting
- Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection
- ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation
- Beyond Caption-Based Queries in Video Moment Retrieval
- An Empirical Study on How Video-LLMs Answer Video Questions
- MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization
- Efficient Frame Selection for Long Video Understanding via Reinforcement Learning
- Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation
- Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation
- HUMAPS-4D: A Multimodal Dataset for HUman Motion Analysis with Physiological and Semantic informations
- From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition
- SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking
- Linear Image Generation by Synthesizing Exposure Brackets
- SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM
- CD-Buffer: Complementary Dual-Buffer Framework for Test-Time Adaptation in Adverse Weather Object Detection
- VSRELL: A Simple Baseline for Video Super-Resolution and Enhancement in Low-Light Environment
- Dual-Level Confidence based Implicit Self-Refinement for Medical Visual Question Answering
- Towards Robust Vision Transformers: Path Dependency Analysis and a Simple Two-Stage Adversarial Training
- FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration
- M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models
- VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
- WPT: World-to-Policy Transfer via Online World Model Distillation
- Haptic Neural Fields: Bringing Tactile Interactions to 3D Rendered Scenes
- PrivSynth: Alternating and Control-Based Optimization for Privacy and Utility in Synthetic Data
- From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training
- SIGMA: A Physics-Based Benchmark for Gas Chimney Understanding in Seismic Images
- LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis
- WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval
- Exemplar-Free Class Incremental Learning via Preserving Class-Discriminative Structure
- Semantic Derivative Flow: Graph-Guided Diffusion for Controllable Instance Interactions
- Point4Cast: Streaming Dynamic Scene Reconstruction and Forecasting
- ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering
- Reconstructing Spiking Neural Networks Using a Single Neuron with Autapses
- CoFiDA-M: Concept-Aware Feature Modulation for Cross-Domain Adaptation with Image-Only Inference
- Human Interaction-Aware 3D Reconstruction from a Single Image
- Contrastive Cross-Bag Augmentation for Multiple Instance Learning-based Whole Slide Image Classification
- WiTTA-Bench: Benchmarking Test-Time Adaptation for WiFi Sensing
- Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models
- Suppressing Non-Semantic Noise in Masked Image Modeling Representations
- Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images
- StableMaterials: Enhancing Diversity in Material Generation via Semi-Supervised Learning
- Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning
- AE2VID: Event-based Video Reconstruction via Aperture Modulation
- FloVerse: Floor Plan-Guided Multi-Modal Navigation
- Next-Scale Autoregressive Models for Text-to-Motion Generation
- PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference
- Recurrent Video Masked Autoencoders
- HFR and HDR Video from Multi-Attenuated Spikes Using a Rapidly Rotating SpokeND Filter
- PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention
- Domain-Skewed Federated Learning with Feature Decoupling and Calibration
- Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods
- Beyond Scanpaths: Graph-Based Gaze Simulation in Dynamic Scenes
- Synthesizing Visual Concepts as Vision-Language Programs
- CG-Floor: Centroid-Guided Diffusion for Large-Scale Floorplan Generation
- WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
- Meta-FC: Meta-Learning with Feature Consistency for Robust and Generalizable Watermarking
- Voxify3D: Pixel Art Meets Volumetric Rendering
- OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory
- Explaining Object Detectors via Collective Contribution of Pixels
- Foundation Encoders Are All You Need for Preference-Aware Personalization
- CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild
- RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video
- Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
- BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds
- DDSF: Robust Few-Shot Learning via Disentangled Subspaces with Determinantal Point Process
- Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
- ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation
- CUE: Concept-Aware Multi-Label Expansion to Mitigate Concept Confusion in Long-Tailed Learning
- SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models
- Mitigating Simplicity Bias in OOD Detection through Object Co-occurrence Analysis
- A Supervised Multi-task Framework for Joint cryo-ET Restoration Enabled by Generative Physical Simulation
- Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
- Same or Not? Enhancing Visual Perception in Vision-Language Models
- Revisiting F-measure Optimization in Multi-Label Classification: A Sampling-based Approach
- Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots
- PromptEnhancer: Taming Your Rewriter for Text-to-Image Generation via Fine-Grained Reward
- FrankenMotion: Part-level Human Motion Generation and Composition
- FE2E: From Editor to Dense Geometry Estimator
- Lighting in Motion: Spatiotemporal HDR Lighting Estimation
- Aligning Text, Images and 3D Structure Token-by-Token
- Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal
- Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
- Dual Graph Regularized Deep Unfolding Network for Guided Depth Map Super-resolution
- FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
- RewardFlow: Generate Images by Optimizing What You Reward
- AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction
- CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving
- Coded-E2LF: Coded Aperture Light Field Imaging from Events
- Neural-Centric Video Processing Pipeline for Unified Multi-Task Inference
- OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios
- Decompose, Mix, Adapt: A Unified Framework for Parameter-Efficient Neural Network Recombination and Compression
- StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues
- Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs
- First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models
- UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes
- Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness
- InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior
- Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers
- Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems
- EV-CGNet: Co-visible Focused 3D-guided 2D Event Keypoint Detection Network
- RAVEN: Radar Adaptive Vision Encoders for Efficient Chirp-wise Object Detection and Segmentation
- OptiMVMap: Offline Vectorized Map Construction via Optimal Multi-vehicle Perspectives
- Visual Personalization Turing Test
- BulletTime: Decoupled Control of Time and Camera Pose for Video Generation
- Parallelised Differentiable Straightest Geodesics for 3D Meshes
- PersonaLive! Expressive Portrait Image Animation for Live Streaming
- CLIP-like Model as a Foundational Density Ratio Estimator
- SATTC: Structure-Aware Label-Free Test-Time Calibration for Cross-Subject EEG-to-Image Retrieval
- Guardians of the Hair: Rescuing Soft Boundaries in Depth, Stereo, and Novel Views
- Inference-time Physics Alignment of Video Generative Models with Latent World Models
- JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization
- Gated KalmaNet: A Fading Memory Layer through Test-time Ridge Regression
- DeDelayed: Deleting Remote Inference Delay via On-Device Correction
- SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration
- ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
- CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography
- Prompt-Anchored Vision–Text Distillation for Lifelong Person Re-identification
- ReFAct: Empowering Multimodal Web Agents with Visual and Context Focusing
- Towards Multimodal Domain Generalization with Few Labels
- Coverage Optimization for Camera View Selection
- FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction
- NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining
- PersonaVLM: Long-Term Personalized Multimodal LLMs
- gQIR: Generative Quanta Image Reconstruction
- SMVRT: Implicit Human 3D Modeling Using Sparse Multi-View Volumetric Reconstruction with Transformer Fusion
- NIL: No-data Imitation Learning
- QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy
- Few-for-Many Personalized Federated Learning
- A More Word-like Image Tokenization for MLLMs
- Image-Guided Geometric Stylization of 3D Meshes
- Boosting Reasoning in Large Multimodal Models via Activation Replay
- Learning from Noisy Supervision: A Denoising–Debiasing Framework for Weakly Supervised Video Anomaly Detection
- HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling
- No Way To Steal My Face: Proactive Defense Against Identity-Preserving Personalized Generation
- Robust Spiking Neural Networks by Temporal Mutual Information
- V^2-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence
- Keep It Frozen: Domain-Routed Conditional Residual Modulation for Multi-Domain Vision Transformers
- SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens
- Semantic Scale Space: A Framework for Controllable Image Abstraction
- ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data
- Geometry-Aware Cross-Modal Graph Alignment for Referring Segmentation in 3D Gaussian Splatting
- QuCNet: Quantum Deep Learning Driven Multi-Circuit Network for Remote Sensing Image Classification
- Image Generation from Contextually-Contradictory Prompts
- Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models
- Spike-driven Discrete Aggregation for Event-based Object Detection
- Seeing Through the Shift: Causality-Inspired Robust Generalized Category Discovery
- Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
- Anomaly as Non-Conformity via Training-Free Graph Laplacian Energy Minimization
- From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity
- VQ-VA World: Towards High-Quality Visual Question-Visual Answering
- RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation
- FMPose3D: monocular 3D pose estimation via flow matching
- Eulerian Gaussian Splatting using Hashed Probability Pyramids
- DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference
- GenBreak: Red Teaming Text-to-Image Generation Using Large Language Models
- Learning to Solve PDEs on Neural Shape Representations
- Solving Minimal Problems Without Matrix Inversion Using FFT-Based Interpolation
- MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
- When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance
- Polyphony: Diffusion-based Dual-Hand Action Segmentation with Alternating Vision Transformer and Semantic Conditioning
- Virtual Full-stack Scanning of Brain MRI via Imputing Any Quantised Code
- Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought
- Few-shot Acoustic Synthesis with Multimodal Flow Matching
- HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis
- RF4D:Neural Radar Fields for Novel View Synthesis in Outdoor Dynamic Scenes
- Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection
- Designing to Forget: Deep Semi-parametric Models for Unlearning
- UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching
- Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition
- AnchorSplat: Feed-Forward 3D Gaussian Splatting With 3D Geometric Priors
- AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception
- Push-and-Step: From RL-Based Balance Recovery to Physical Simulation of Dense Crowds
- Negative Binomial Variational Autoencoders for Overdispersed Latent Modeling
- Differentially Private 2D Human Pose Estimation
- Cross-Domain Demo-to-Code via Neurosymbolic Counterfactual Reasoning
- OpenMMReasoner: Pushing the Frontiers in Multimodal Reasoning with an Open and General Recipe
- 2-Shots in the Dark: Low-Light Denoising with Minimal Data Acquisition
- Fine-Grained Multi Image Object Hallucination Benchmark
- Adaptive Confidence Regularization for Multimodal Failure Detection
- A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning
- KAMP: Knowledge-Anchored Multimodal Pretraining Framework for Medical Image Representation
- A Sanity Check for Multi-In-Domain Face Forgery Detection in the Real World
- AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis
- VEMamba: Efficient Isotropic Reconstruction of Volume Electron Microscopy with Axial-Lateral Consistent Mamba
- EmoDiffTalk: Emotion-aware Diffusion for Editable 3D Gaussian Talking Head
- Endless World: Real-Time 3D-Aware Long Video Generation
- Volumetric Functional Maps
- TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos
- Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement
- LoST: Level of Semantics Tokenization for 3D Shapes
- TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens
- Turbo-GS: Accelerating 3D Gaussian Fitting for High-Resolution Radiance Fields
- Gaussian Mapping for Evolving Scenes
- Copy-Transform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints
- GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator
- MDCS-MoAME: Multi-directional Composite Scanning with Mixture of Attention and Mamba Experts for Cancer Survival Prediction
- MV-TAP: Tracking Any Point in Multi-View Videos
- InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs
- Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images
- CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection
- ChordEdit: One-Step Low-Energy Transport for Image Editing
- InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models
- SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models
- GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
- PPM-CLIP: Probabilistic Prompt Modeling for Generalizable AI-Generated Image Detection
- Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting
- Lafite: A Generative Latent Field for 3D Native Texturing
- Point Cloud as a Foreign Language for Multi-modal Large Language Model
- Action Motifs: Self-Supervised Hierarchical Representation of Human Body Movements
- Training-free Mixed-Resolution Latent Upsampling for Spatially Accelerated Diffusion Transformers
- Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters
- Learning to Identify Out-of-Distribution Objects for 3D LiDAR Anomaly Segmentation
- EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR
- GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction
- LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World
- Distilling Balanced Knowledge from a Biased Teacher
- Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition
- TextFM: Robust Semi-dense Feature Matching with Language Guidance
- Obstruction Reasoning for Robotic Grasping
- Gradient Knows Best: Mixed-Precision Quantization via Gradient-Guided Bit Allocation for Super-Resolution
- Improving Adversarial Transferability with Local Perturbation Augmentation
- VOSR: A Vision-Only Generative Model for Image Super-Resolution
- MAD: Motion Appearance Decoupling for efficient Driving World Models
- MeshMosaic: Scaling Artist Mesh Generation via Local-to-Global Assembly
- From Spots to Pixels: Dense Spatial Gene Expression Prediction from Histology Images
- Hermite Radial Basis Function for Surface Reconstruction via Differentiable Rendering
- TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
- TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection
- SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching
- Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation
- POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
- Unleashing the Power of Chain-of-Prediction for Monocular 3D Object Detection
- Generative Video Motion Editing with 3D Point Tracks
- CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning
- RoboAgent: Chaining Basic Capabilities for Embodied Task Planning
- Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation
- 3D Gaussian Splatting at Arbitrary Resolutions with Compact Proxy Anchors
- PAVAS: Physics-Aware Video-to-Audio Synthesis
- A Stitch in Time: Learning Procedural Workflow via Self-Supervised Plackett–Luce Ranking
- Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting
- AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend
- FPSBench: A Benchmark for Video Understanding at High Frame Rates
- PMRNet: Physics-informed Multi-scale Refinement Network for Medical Image Segmentation
- Attribute-Preserving Pseudo-Labeling for Diffusion-Based Face Swapping
- DarkShake-DVS: Event-based Human Action Recognition under Low-light and Shaking Camera Conditions
- HyperGait: Unleashing the Power of Parsing for Gait Recognition in the Wild via Hypergraph
- Orthogonal Spatial-Aware Multi-View Anchor Graph Clustering for Incomplete Remote Sensing Data
- V-DPM: 4D Video Reconstruction with Dynamic Point Maps
- 3D-LATTE: Latent Space 3D Editing from Textual Instructions
- EnergyAction: Unimanual to Bimanual Composition with Energy-Based Models
- Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning
- Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
- Composing Concepts from Images and Videos via Concept-prompt Binding
- DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis
- MMCP-GEN: A Modality-Extensible Diffusion Language Model for Conditional Protein Sequence Generation
- Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models
- Self-Corrected Image Generation with Explainable Latent Rewards
- Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection
- URScenes: A Multi-scenario Dataset for Unstructured Road Environments
- When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness
- Modeling the Visual Ambiguity of Human Sketches
- PoseD-Flow: Versatile and Guided Flow Matching Model of Human Pose
- Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence
- SynthRGB-T: Language-Vision Guided Image Translation for Diversity Synthesis
- IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
- Precise Object and Effect Removal with Adaptive Target-Aware Attention
- AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment
- AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks
- Block-based Learned Image Compression without Blocking Artifacts
- MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention
- Tunable Soft Equivariance with Guarantees
- Expanding mmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets
- REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting
- MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition–Perception–Reasoning Guided Text-Image Machine Translation
- SPE-MVS: Spatial Position Encoding Enhanced Multi-View Stereo with Monocular Depth Priors
- Factorize, Reconstruct, Enhance: A Unified Framework for Multimodal Sentiment Analysis
- PhenoYieldNet: Learning Crop-Aware Phenological Responses for Multi-Crop Yield Prediction
- Geo2: Geometry-Guided Cross-view Geo-Localization and Image Synthesis
- HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation
- LiDAR-to-4DRadar Diffusion Bridge via Cross-Modal Alignment and Translation in Latent Space
- S^2FT: Parameter-Efficient Fine-Tuning in Sparse Spectrum Domain
- UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark
- VENI: Variational Encoder for Natural Illumination
- RMAE-ProGRess: Advancing Semantic Segmentation in Unstructured Environments
- VisiLock: Authorizing Instruction-based Image editing with Dual Score Distillation
- SIF: Semantically In-Distribution Fingerprints for Large Vision-Language Models
- Streaming Video Crime Anticipation with Spatio-Temporal Causal Reasoning
- Charge: A Comprehensive Novel View Synthesis Benchmark and Dataset to Bind Them All
- UIKA: Fast Universal Head Avatar from Pose-Free Images
- Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs
- Temporal Equilibrium MeanFlow: Bridging the Scale Gap for One-Step Generation
- Scene Reconstruction as Mapping Priors for 3D Detection
- SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs
- Restore, Assess, Repeat: A Unified Framework for Iterative Image Restoration
- TokenLight: Precise Lighting Control in Images using Attribute Tokens
- EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation
- History to Future: Evolving Agent with Experience and Thought for Zero-shot Vision-and-Language Navigation
- MetricHMSR: Metric Human Mesh and Scene Recovery from Monocular Images
- Beyond Myopic Alignment: Lookahead Optimization for Online Class-Incremental Learning
- Weight Space Representation Learning via Neural Field Adaptation
- Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation
- DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime
- ProgTrack: A Multi-Object Tracking Algorithm with Progressive Matching Strategy
- Low-Resolution Editing is All You Need for High-Resolution Editing
- Draft and Refine with Visual Experts
- Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing
- Content-Adaptive Hierarchical Hyperprior for Neural Video Coding
- Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image
- Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection
- Mario: Multimodal Graph Reasoning with Large Language Models
- Fine-Tuning Impairs the Balancedness of Foundation Models in Long-tailed Personalized Federated Learning
- Hyperbolic Defect Feature Synthesis for Few-Shot Defect Classification
- CanonCGT: Reference-Based Color Grading via Canonical Pivot Representation
- Energy Waveify and Redistribution for Test-Time Adaptation: A Control System Perspective
- Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset
- Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining
- ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data
- Critical Patch-Aware Sparse Prompting with Decoupled Training for Continual Learning on the Edge
- Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs
- Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing
- SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation
- TAPE: Task-Adaptive Prototype Evolution in Audio-Language Models for Fully Few-shot Class-incremental Audio Classification
- Representation-Steered Incremental Adapter-Tuning for Class-Incremental Learning with Pre-Trained Models
- Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting
- PAS: Prelim Attention Score for Detecting Object Hallucinations in Large Vision-Language Models
- Grounded Latents for Entity-Centric 4D Scene Generation
- InternVideo-Next: Towards World-Understanding Video Models
- MicroFM: Physics-guided Flow Matching for Isotropic Microscopy Reconstruction
- How Far Can We Go With Synthetic Data for Audio-Visual Sound Source Localization?
- MoCoDiff: A Controllable Autoregressive Diffusion Model for Expressive Motion Generation
- CryoKRAQEN: Kernel-Regularized Annealing for Quantized Embedding Networks in Cryo-EM Heterogeneous Reconstruction
- IAG: Input-aware Backdoor Attack on VLM-based Visual Grounding
- TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting
- Spatial-SAM: Spatially Consistent 3D Electron Microscopy Segmentation with SDF Memory and Semi-Supervised Learning
- FUN REC Reconstructing Functional 3D Scenes from Egocentric Interaction Videos
- SVAgent: Storyline-guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
- AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models
- CGHair: Compact Gaussian Hair Reconstruction with Card Clustering
- Selectively Extracting and Injecting Visual Attributes into Text-to-Image Models
- SG-LoRA: Semantic-guided LoRA Parameters Generation
- HCL-FF: Hierarchical and Contrastive Learning for Forward-Forward Algorithm
- Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
- Spectral Super-Resolution via Adversarial Unfolding and Data-Driven Spectrum Regularization: From Multispectral Satellite Data to NASA Hyperspectral Image
- Illustrator’s Depth: Monocular Layer Index Prediction for Image Decomposition
- PAF: Perturbation-Aware Filtering for Open-Set Semi-Supervised Learning
- HTTM: Head-wise Temporal Token Merging for Faster VGGT
- TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification
- Self-Diffusion Driven Blind Imaging
- TRANSPORTER: Transferring Visual Semantics from VLM Manifolds
- SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling
- StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives
- Parse, Search, and Confirmation: Training-Free Aerial Vision-and-Dialog Navigation with Chain-of-Thought Reasoning and Structured Spatial Memory
- Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
- FlashMesh: Faster and Better Autoregressive Mesh Synthesis via Structured Speculation
- AcTTA: Rethinking Test-Time Adaptation via Dynamic Activation
- VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM
- ReasonEdit: Towards Reasoning-Enhanced Image Editing Models
- Stitch-a-Demo: Creating Video Demonstrations from Multistep Descriptions
- VecGlypher: Unified Vector Glyph Generation with Language Models
- Defending Unauthorized Model Merging via Dual-Stage Weight Protection
- SVHalluc: Benchmarking Speech–Vision Hallucination in Audio-Visual Large Language Models
- TRIDENT: A Trimodal Cascade Generative Framework for Drug and RNA-Conditioned Cellular Morphology Synthesis
- SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation
- A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps
- Soft Modality-Guided Expert Specialization in MoE-VLMs
- AdaDexTrack: Dynamic Modulation for Adaptive and Generalizable Dexterous Manipulation Tracking
- SoccerMaster: A Vision Foundation Model for Soccer Understanding
- SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World
- Perceptual 3D Simulation With Physical World Modeling
- Cleaning the Pool: Progressive Filtering of Unlabeled Pools in Deep Active Learning
- Random Wins All: Rethinking Grouping Strategies for Vision Tokens
- Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration
- MuM: Multi-View Masked Image Modeling for 3D Vision
- WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
- Unsupervised Multi-Scale Segmentation of 3D Subcellular World with Stable Diffusion Foundation Model
- 3D Gaussian Splatting with Self-Constrained Priors for High Fidelity Surface Reconstruction
- Portable Active Learning for Object Detection
- Generalized and Personalized Federated Learning with Black-Box Foundation Models via Orthogonal Transformations
- Depth Peeling for High-Fidelity Gaussian-Enhanced Surfel Rendering
- CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
- LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding
- SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation
- Perceiving the Near, Reasoning the Distant: Coherent Long-Horizon Trajectory Prediction for Autonomous Driving
- TGT: Text-Grounded Trajectories for Locally Controlled Video Generation
- Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
- ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors
- Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
- Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis
- EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding
- Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction
- Revisiting Pose Sensitivity in Splat-based Computed Tomography under Sparse-view Reconstruction
- DASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples
- Relightful Video Portrait Harmonization
- Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation
- ProcessMaker: A Generalized Process Visualization Framework with Adaptive Sequence Steps on Diffusion Transformers
- Hierarchical Point-Patch Fusion with Adaptive Patch Codebook for 3D Shape Anomaly Detection
- Prototypical Action Reasoning Facilitated by Vision-Language Alignment for Egocentric Action Anticipation
- DARC: Dual Adjustment Reasoning with Counterfactuals for Trustworthy Chest X-ray Classification
- Towards Robust Multimodal Large Language Models Against Jailbreak Attacks
- Intrinsic Image Fusion for Multi-View 3D Material Reconstruction
- Learning 3D Shape Fidelity Metric from Real-world Distortions
- Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow
- Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics for Robot Manipulation
- 3D-IDE: 3D Implicit Depth Emergent
- IGen: Scalable Data Generation for Robot Learning from Open-World Images
- Stake the Points: Structure-Faithful Instance Unlearning
- Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding
- DriveVLN: Towards Mapless Vision-and-Language Navigation in Autonomous Driving
- HoneyBee: Data Recipes for Vision-Language Reasoners
- VGGT-Ω
- LoFA: Learning to Predict Personalized Prior for Fast Adaptation of Visual Generative Models
- Fed-ADE: Adaptive Learning Rate for Federated Post-adaptation under Distribution Shift
- V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties
- Hidden Monotonicity: Explaining Deep Neural Networks via their DC Decomposition
- RefAV: Towards Planning-Centric Scenario Mining
- Test-Time Training for LiDAR Semantic Segmentation under Corruption via Geometric Inlier Discrimination
- MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations
- Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM
- FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy
- FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection
- ELVIS: Enhance Low-Light for Video Instance Segmentation in the Dark
- PE3R: Perception-Efficient 3D Reconstruction
- LeapAlign: Post-training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
- Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining
- Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images
- Multi-speaker Attention Alignment for Multimodal Social Interaction
- PackUV: Packed Gaussian UV Maps for 4D Volumetric Video
- MEMO: Human-like Crisp Edge Detection Using Masked Edge Prediction
- Continuous Exposure-Time Modeling for Realistic Atmospheric Turbulence Synthesis
- GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views
- ShadowDraw: From Any Object to Shadow-Drawing Compositional Art
- Geometric-Photometric Event-based 3D Gaussian Ray Tracing
- FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution
- Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty
- Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization
- Cycle-Consistent Tuning for Layered Image Decomposition
- INSID3: Training-Free In-Context Segmentation with DINOv3
- MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration
- Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance
- GraPHFormer: A Multimodal Graph Persistent Homology Transformer for the Analysis of Neuroscience Morphologies
- MatSpray: Fusing 2D Material World Knowledge on 3D Geometry
- Dual Ascent Diffusion for Inverse Problems
- DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation
- TriSim: Tri-Dimensional Similarity Modeling with Extreme Value Theory for False-Negative Mitigation in Remote Sensing Image-Text Retrieval
- The Geometry of Robustness: Optimizing Loss Landscape Curvature and Feature Manifold Alignment for Robust Finetuning of Vision-Language Models
- Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D
- PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery
- PaQ-DETR: Learning Pattern and Quality-Aware Dynamic Queries for Object Detection
- PointThinker: Point-Incentivized Parallel Thinking for Multimodal Large Language Model
- Particulate: Feed-Forward 3D Object Articulation
- Learning to See through Illumination Extremes with Event Streaming in Multimodal Large Language Models
- Unified Number-Free Text-to-Motion Generation Via Flow Matching
- PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration
- Zoo3D: Zero-Shot 3D Object Detection at Scene Level
- Diagram2Structure: Unlocking LLMs' Diagram Comprehension through DiagramDiff, an Offline Diagram Structuring Framework
- Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction
- Dynamic Visual SLAM using a General 3D Prior
- DuoGen: Towards Autonomous Interleaved Multimodal Generation
- Exemplar-Free Continual Learning for State Space Models
- B^3-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta-Bernoulli Bayesian Updates
- Pixel Motion Diffusion is What We Need for Robot Control
- Breaking the Scalability Limit of Multi-Projector Calibration with Embedded Cameras
- Unsupervised Multi-agent and Single-agent Perception from Cooperative Views
- Reconstruction-Guided Slot Curriculum: Addressing Object Over-Fragmentation in Video Object-Centric Learning
- GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding
- FEAST: Fully Connected Expressive Attention for Spatial Transcriptomics
- Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding
- Hierarchical Process Reward Models are Symbolic Vision Learners
- NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering
- PIX-TAB: Efficient PIXel-Precise TABle Structure Recognition Approach with Speculative Decoding and Region-Based Image Segmentation
- Modeling Cross-vision Synergy for Unified Large Vision Model
- High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning
- Fully Decentralized Certified Unlearning
- Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3-D Constrained Terrains
- ReLaGS: Relational Language Gaussian Splatting
- FSFSplatter: Geometrically Accurate Reconstruction with Free Sparse-view Images within 2 minutes
- Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation
- AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
- M3Grounder: Mask-Based Multi-Span and Multi-Granular Grounding for Document QA
- MDS-VQA: Model-Informed Data Selection for Video Quality Assessment
- GenHOI: Towards Object-Consistent Hand–Object Interaction with Temporally Balanced and Spatially Selective Object Injection
- Harmonized Feature Conditioning and Frequency-Prompt Personalization for Multi-Rater Medical Segmentation
- Efficient Encoder-Free Fourier-based 3D Large Multimodal Model
- Prototype-as-Prompt: Multimodal Sentiment Prototypes Endowing Large Language Models the Capability to Perform Multimodal Sentiment Analysis
- TimeRipples: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space
- UniDAC: Universal Metric Depth Estimation for Any Camera
- FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing
- Probabilistic Precipitation Nowcasting with Rectified Flow Transformers
- Text-guided Feature Disentanglement for Cross-modal Gait Recognition
- Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning
- One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control
- MUFASA: A Multi-Layer Framework for Slot Attention
- Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events
- From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection
- Linking Perception, Confidence and Accuracy in MLLMs
- OrienPose: Orientation-Guided Novel View Synthesis for Single-Image Unseen Object Pose Estimation
- AdaSVD: Singular Value Decomposition with Adaptive Mechanisms for Large Multimodal Models
- WorldGen: From Text to Traversable and Interactive 3D Worlds
- EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images
- Verifying Neural Network Robustness with Dual Perturbations
- Synthetic Curriculum Reinforces Compositional Text-to-Image Generation
- RelightAnyone: A Generalized Relightable 3D Gaussian Head Model
- ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation
- GeoSANE: Learning Geospatial Representations from Models, Not Data
- PhyOceanCast: Global Ocean Forecasting with Physics-Informed Diffusion
- SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection
- VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
- Retrieving Counterfactuals Improves Visual In-Context Learning
- Sparse–View Localization via Online Neural 3D Regression
- Evidential Deep Partial Label Learning to Quantify Disambiguation Uncertainty
- RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment
- HAMMER: Harnessing MLLMs via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding
- ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models
- ODGS-SLAM: Omnidirectional Gaussian Splatting SLAM
- Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions
- H2-Surv: Hierarchical Hyperbolic Multimodal Representation Learning for Survival Prediction
- ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking
- RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution
- UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking
- DTG-Restore: Training-Free Diffusion Refinement for Generative Video Super-Resolution
- Temporal Inversion for Learning Interval Change in Chest X-Rays
- PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion
- Solvability of the Viewing Graph Under the Affine Camera Model
- GazeShift: Unsupervised Gaze Estimation and Dataset for VR
- InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions
- Enhancing Spatial Understanding in Image Generation via Reward Modeling
- Thinking in 360°: Humanoid Visual Search in the Wild
- Detecting Unknown Objects via Energy-based Separation for Open World Object Detection
- OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective
- Phantom: Physical Object Interactions as Dynamic Triggers for NMS-Exploited Backdoors
- ReFTA: Breaking the Weight Reconstruction Bottleneck in Tensorized Parameter-Efficient Fine-Tuning
- CI-VID: A Coherent Interleaved Text-Video Dataset
- VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale
- Streamlined Knowledge Distillation
- UniVBench: Towards Unified Evaluation for Video Foundation Models
- Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events
- SARL-STG: A Spatially Aware Reinforcement Learning Framework for Refining MLLMs in Spatio-Temporal Video Grounding
- CaptionQA: Is Your Caption as Useful as the Image Itself?
- SABER: Spatially Consistent 3D Universal Adversarial Objects for BEV Detectors
- DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos
- Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
- RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding
- PowerCLIP: Powerset Alignment for Contrastive Pre-Training
- DC-Merge: Improving Model Merging with Directional Consistency
- Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos
- Ground Reaction Inertial Poser: Physics-based Human Motion Capture from Sparse IMUs and Insole Pressure Sensors
- TIGER: A Unified Framework for Time, Images and Geo-location Retrieval
- Clay-to-Stone: Phase-wise 3D Gaussian Splatting for Monocular Articulated Hand-Object Manipulation Modeling
- DynBridge: Bridging Imagination and Control through Interaction Dynamics for Robot Manipulation
- Generalizable Video Quality Assessment via Weak-to-Strong Learning
- PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation
- EchoPOSE: 6D Pose Estimation of Sparse Echocardiograms for Left-Ventricular 3D Shape Reconstruction
- Resolving the Stability-Plasticity Dilemma in Reinforcement Learning via Complementary Continual Critics
- A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
- AntiStyler: Defending Object Detection Models Against Adversarial Patch Attacks Using Style Removal
- MooCap: A Multi-View Benchmark for Cow-Object-Human Interaction and Behavior Dynamics
- Linear Fundamental Matrix Estimation from 7 or 5 Points
- CREward: A Type-Specific Creativity Reward Model
- WHU-MARS: A Multispectral Aerial-Ground Benchmark Towards Any-Scenario Person Re-Identification
- TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering
- Select Less, Reason More: Prioritizing Evidence Purity for Video Reasoning
- Dynamic Token Reweighting for Robust Vision-Language Models
- BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection
- SpeeDe3DGS: Speedy Deformable 3D Gaussian Splatting with Temporal Pruning and Motion Grouping
- Adaptive Bayesian Early-Exit Networks for Efficient Non-Transferable Learning
- EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
- InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity
- Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers
- PECCVAI: Overcoming the Brittleness of AI Image Watermarking Under Visual Paraphrasing Attacks
- ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis
- NEAF: Natural Image Editing with Attention Fusion for Generalizable Test-time Optimization in Text-Guided Image Editing
- Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface
- HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars
- MARCO: Navigating the Unseen Space of Semantic Correspondence
- MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing
- NitroGen: An Open Foundation Model for Generalist Gaming Agents
- Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification
- R^2-Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection
- MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping
- Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere
- RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation
- Stronger Normalization-Free Transformers
- Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes
- Beyond Layer-Wise Merging: Chain-of-Merging for Vision-Language Models
- FisherPoser: Human Motion Estimation from Sparse Observations with Hierarchical Region-Wise Fisher-Matrix Uncertainty Modeling
- ShapeR: Robust Conditional 3D Shape Generation from Casual Captures
- FastGaMer: Efficient GainMap Learning for Practical Inverse Tone Mapping
- Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners
- Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers
- Nestwork: Conditional 3D Furnished House Layout Generation through Latent Heterogeneous Graph Diffusion
- BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation
- Mixture of Prototypes for Test-time Adaptive Segmentation
- Seele: A Unified Acceleration Framework for Real-Time Gaussian Splatting on Mobile Devices
- MoVie: Broaden Your Views with Human Motion for Action Detection
- Learning Scene Coordinate Reconstruction from Unposed Images via Pose Graph Optimization
- When Anonymity Breaks: Identifying Models Behind Text-to-Image Leaderboards
- FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning
- OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning
- VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network
- TSTM: Temporal Segmentation for Task-relevant Mask in Visual Reinforcement Learning Generalization
- Text-Driven 3D Hand Motion Generation from Sign Language Data
- See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles
- Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers
- Phrase-grounded APO for Improving Chest X-ray Report Generation
- FEAT: Fashion Editing and Try-On from Any Design
- LaRP: Efficient Multi-View Inpainting with Latent Reprojection Priors
- GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials
- Deep Feature Deformation Weights
- Multimodal Distribution Matching for Vision-Language Dataset Distillation
- Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
- Scaling Spatial and Temporal Context for Robotic Imitation Learning Policies With Scene Graphs
- VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding
- HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles
- Unifying Precise Keyframes and Semantic Control via Multi-level Diffusion
- Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior
- DPAR: Dynamic Patchification for Efficient Autoregressive Visual Generation
- Enhancing Part-Level Point Grounding for Any Open-Source MLLMs
- PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation
- REL-SF4PASS: Panoramic Semantic Segmentation with REL Depth Representation and Spherical Fusion
- DiffBMP: Differentiable Rendering with Bitmap Primitives
- GHPT: Real-Time Relightable Gaussian Splatting using Hybrid Path Tracing
- Coupling Liquid Time‑Constant Encoders with Modern Hopfield Memory
- Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery
- PAI-Bench: A Comprehensive Benchmark For Physical AI
- When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models
- Muses: Designing, Composing, Generating Nonexistent Fantasy 3D Creatures without Training
- Generative Diffusion Priors for 3D Mapping of the Dark Universe
- Splat-Based Metal Artifact Reduction in Cone-Beam CT via Compact Attenuation Modeling
- Seeing What Matters: Visual Preference Policy Optimization for Visual Generation
- Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction
- Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods
- REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement
- Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild
- YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction
- Differentiable Stroke Planning with Dual Parameterization for Efficient and High-Fidelity Painting Creation
- GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion
- RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models
- Hypergraph-State Collaborative Reasoning for Multi-Object Tracking
- AirSim360: A Panoramic Simulation Platform within Drone View
- EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval
- On the Role of Temporal Granularity in the Robustness of Spiking Neural Networks
- PhysGen: Physically Grounded 3D Shape Generation for Industrial Design
- Zero-Shot Image Denoising via Hybrid Prior-Guided Pseudo Sample Generation
- EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation
- OccAny: Generalized Unconstrained Urban 3D Occupancy
- Beyond the Static World: Continual Category Discovery under Visual Drift
- Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation
- Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation
- PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
- Enhancing Out-of-Distribution Detection with Extended Logit Normalization
- OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding
- 240FPS Stereo Vision from Monocular Mixed Spikes
- SCE-Depth: A Spherical Compound Eye Framework for Wide FOV Depth Estimation
- Every Error has Its Magnitude: Asymmetric Mistake Severity Training for Multiclass Multiple Instance Learning
- AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting
- Unique Lives, Shared World: Learning from Single-Life Videos
- Task-Driven Implicit Representations for Automated Design of LiDAR Systems
- Global Structure-from-Motion Meets Feedforward Reconstruction
- GS-ASM: 2DGS-Supervised Active Stereo Matching
- GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents
- Feed-forward Gaussian Registration for Head Avatar Creation and Editing
- EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation
- TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond
- Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation
- Resolving Endpoint Underfitting in Diffusion Bridges via Noise Alignment
- Captain Safari: A World Engine with Pose-Aligned 3D Memory
- Can a Second-View Image Be a Language? Geometric and Semantic Cross-Modal Reasoning for X-ray Prohibited Item Detection
- Vibe Spaces for Creatively Connecting and Expressing Visual Concepts
- MMLandmarks: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding
- Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation
- CHEEM: Continual Learning by Reuse, New, Adapt and Skip - A Hierarchical Exploration-Exploitation Approach
- LumiX: Structured and Coherent Text-to-Intrinsic Generation
- FARMER: Flow AutoRegressive Transformer over Pixels
- No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency
- 4DSurf: High-Fidelity Dynamic Scene Surface Reconstruction
- Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models
- BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
- OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation
- InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting
- BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting
- Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts
- SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation
- MagicQuill V2: Precise and Interactive Image Editing with Layered Visual Cues
- Structural Graph Probing of Vision–Language Models
- VABench: A Comprehensive Benchmark for Audio-Video Generation
- CARLoS: Retrieval via Concise Assessment Representation of LoRAs at Scale
- Rethinking Token Reduction for Large Vision-Language Models
- Opti-NeuS: Neural Reconstruction for Dual-Layered Transparent and Opaque Objects
- X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis
- CASPA: Graph-Structured Concept Anchors for Modality-Agnostic Adaptation in Vision–Language Models
- Virtual Immunohistochemistry Staining with Dual-Aligned Multi-Task Feature Guidance
- ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control
- SegMo: Co-Designing Content-Aware Sparsity and Locally-Cohesive Segment Parallelism for Efficient VLM Inference
- Structure-Aware Representation Distillation for Tiny-Dense Object Segmentation
- HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction
- Adapting In-context Generation for Enhanced Composed Image Retrieval
- A Bit is All You Need! Efficient Video Capture via Single Bit Imaging
- Selective Amnesia using Contrastive Subnet Erasure for Class Level Unlearning in Vision Models
- VCP-Attack: Visual-Contrastive Projection for Transferable Black-Box Targeted Attacks on Large Vision-Language Models
- VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
- Free-Grained Hierarchical Visual Recognition
- ORD: Object-Relation Decoupling for Generalized 3D Visual Grounding
- Spot The Ball: A Benchmark for Visual Social Inference
- SPOT: Spatiotemporal Prompt Optimization for Motion-Stabilized MLLM-Guided Video Segmentation
- BALM: A Model-Agnostic Framework for Balanced Multimodal Learning under Imbalanced Missing Rates
- Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation
- Multi-modal Test-time Adaptation via Adaptive Probabilistic Gaussian Calibration
- IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment
- PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving
- Robustness Under Data Scarcity: Few-Shot Continual Adversarial Training for Evolving Threats
- Hist2Style: Histogram-Guided Stylization with Bilateral Grids
- MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning
- Consensus vs. Controversy: Mapping the Decision Space Where Architectures Diverge
- A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection
- BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment
- Focus on Background: Exploring SAM's Potential in Few-shot Medical Image Segmentation with Background-centric Prompting
- TC-Padé: Trajectory-Consistent Padé Approximation for Diffusion Acceleration
- FireScope: Wildfire Risk Raster Prediction With a Chain-of-Thought Oracle
- BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird’s-Eye View Images
- Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models
- BoostSLT: Boosting Sign Language Translation via a Plug-and-Play Diffusion-Based Semantic Enhancer
- Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models
- TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition
- Agentic Retoucher for Text-To-Image Generation
- VRR-QA: Visual Relational Reasoning in Videos Beyond Explicit Cues
- ProSoftArena: Benchmarking Hierarchical Capabilities of Multi-modal Agents in Professional Software Environments
- Progressive Multi-cue Alignment for Unaligned RGBT Tracking
- SVBench: Evaluation of Video Generation Models on Social Reasoning
- Guiding Diffusion Models with Fine-Grained Conditions and Semantics-Preserving Sampling for One-Shot Federated Learning
- Align Images Before You Generate
- Learning Diffeomorphism for Medical Image Registration with Time-Embedded Architectures Using Semigroup Regularization
- SHAPE: Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation for Medical Image Segmentation
- High-Quality and Efficient Turbulence Mitigation with Events
- Roots Beneath the Cut: Uncovering the Risk of Concept Recovery in Pruning-Based Unlearning for Diffusion Models
- Compressed-Domain-Aware Online Video Super-Resolution
- FreeForm: Reduced-Order Deformable Simulation from Particle-Based Skinning Eigenmodes
- Understanding Task Transfer in Vision-Language Models
- Rethinking Asymmetric Quantization: Hidden Symmetry in Vision Model Weights
- Explaining CLIP Zero-shot Predictions Through Concepts
- FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning
- Region-Aware Instance Consistency Learning for Micro-Expression Recognition
- Quota-Calibrated Fine-Grained Alignment with Context-Aware Marginals for Text-based Person Retrieval
- TAR: Token-Aware Refinement for Fine-grained Generalized Category Discovery
- Mixture-of-Experts based Feature Decoupling for Open Vocabulary Scene Graph Generation
- Depth Any Endoscopy: Towards Self-Supervised Generalizable Depth Estimation in Monocular Endoscopy
- Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles
- Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning
- QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition
- VidEoMT: Your ViT is Secretly Also a Video Segmentation Model
- NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction
- Driving on Registers
- Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning
- AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors
- Personalized Federated Training of Diffusion Models with Privacy Guarantees
- Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction
- Generalizable Co-Salient Object Detection via Mixed Content-Style Modulation
- TM-BSN: Triangular-Masked Blind-Spot Network for Real-World Self-Supervised Image Denoising
- Focal–General Diffusion Model with Semantic Consistent Guidance for Sign Language Production
- Mirror Illusion Art
- KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing
- WaTeRFlow: Watermark Temporal Robustness via Flow Consistency
- SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting
- Efficient Equivariant Transformer for Self-Driving Agent Modeling
- Universal Guideline-Driven Image Clustering via a Hybrid LLM Agent
- HyCal: A Training-Free Prototype Calibration Method for Cross-Discipline Few-Shot Class-Incremental Learning
- Factorized Context Aggregation for Robust Cancer Risk Estimation via Soft Re-Ranked Retrieval and Hierarchical Anchors
- IntrinsicWeather: Controllable Weather Editing in Intrinsic Space
- FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-and-Language Navigation
- WildPose: A Unified Framework for Robust Pose Estimation in the Wild
- Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing
- Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos
- DSERT-RoLL: Robust Multi-Modal Perception for Diverse Driving Conditions with Stereo Event-RGB-Thermal Cameras, 4D Radar, and Dual-LiDAR
- VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension
- PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion
- Describe Anything Anywhere At Any Moment
- Breaking the Regional Perception Bottleneck of Multimodal Large Language Models via External Reasoning Framework
- Finding Distributed Object-Centric Properties in Self-Supervised Transformers
- Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views
- DynFusion: Rethinking Condition Fusion for Adaptive Multi-Conditional Text-to-Image Generation
- Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep
- Physics-Consistent Diffusion for Efficient Fluid Super-Resolution via Multiscale Residual Correction
- Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage
- No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models
- Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model
- SineProject: Machine Unlearning for Stable Vision-Language Alignment
- Depth Hypothesis Guided Iterative Refinement for Event–Image Monocular Depth Estimation
- Order Matters: 3D Shape Generation from Sequential VR Sketches
- FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection
- From Corners to Fiducial Tags: Revisiting Checkerboard Calibration for Event Cameras
- VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation
- RoboTAG: End-to-end Robot Pose Estimation via Topological Alignment Graph
- MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization
- 2D-LFM: Lifting Foundation Model without 3D Supervision
- INSIGHT Bench: Towards Grounded IN-SItu Guidance for Robotic ManipulaTion
- TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis
- Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection
- ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation
- Parallel Rigidity Matters for Bundle Adjustment
- Is the Modality Gap a Bug or a Feature? A Robustness Perspective
- KLIP: Localized Distribution Shift Detection via KL-Divergence with Diffusion Priors in Inverse Problems
- Towards Uncertainty-aware Unsupervised Domain Adaptation for Videos and Time-Series with Causal Optimal Transport
- Dynamic Magic: Unleashing Restricted Knowledge for Lifelong Person Re-Identification
- Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
- Red-teaming Retrieval-Augmented Diffusion Models via Poisoning Knowledge Bases
- Reliable Policy Transfer for Safety-Aware End-to-End Driving with Deep Reinforcement Learning
- cryoSENSE: Compressive Sensing Enables High-throughput Microscopy with Sparse and Generative Priors on the Protein Cryo-EM Image Manifold
- Minimal Constraint Relaxation for Multiview Autocalibration
- EfficientVPR: Toward Efficient Visual Place Recognition via Scene-Aware Prompt Tuning and Adaptive Feature Enhancement
- OrthoFuse: Training-free Riemannian Fusion of Orthogonal Style-Concept Adapters for Diffusion Models
- CADC: Content Adaptive Diffusion-Based Generative Image Compression
- Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation
- NeuroSeg Meets DINOv3: Transferring 2D Self-Supervised Visual Priors to 3D Neuron Segmentation via DINOv3 Initialization
- ReasonX: MLLM-Guided Intrinsic Image Decomposition
- How to Take a Memorable Picture? Empowering Users with Actionable Feedback
- TESO: Online Tracking of Essential Matrix by Stochastic Optimization
- RemedyGS: Defend 3D Gaussian Splatting Against Computation Cost Attacks
- DSO: Direct Steering Optimization for Bias Mitigation
- UniPart: Part-Level 3D Generation with Unified 3D Geom–Seg Latents
- Reward Sharpness-Aware Fine-Tuning for Diffusion Models
- SonoWorld: From One Image to a 3D Audio-Visual Scene
- MimicTalker: A Multimodal Interactive and Memory-Enhanced Framework for Real-Time Dyadic 3D Head Generation
- DualPrim: Compact 3D Reconstruction with Positive and Negative Primitives
- Latent Implicit Visual Reasoning
- MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection
- Denoise and Align: Towards Source-Free UDA for Robust Panoramic Semantic Segmentation
- Focus, Don’t Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding
- SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval
- Scene Grounding in the Wild
- InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization
- TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction
- High Resolution Neural Video Coding with Bi-directional Confidence-Guided Reference Information Modeling
- Iris: Integrating Language into Diffusion-based Monocular Depth Estimation
- AdaSFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments
- G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval
- MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures
- ORION: ORthonormal Text Encoding for Universal VLM AdaptatION
- Multinex: Lightweight Low-light Image Enhancement via Multi-prior Retinex
- RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization
- GM-R^2: Generative Matching Learning for Unsupervised Geometric Representation and Registration
- FabricGen: Microstructure-Aware Woven Fabric Generation
- Mirai: Autoregressive Visual Generation Needs Foresight
- OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation
- Time Blindness: Why Video-Language Models Can’t See What Humans Can?
- SAT-RRG: LLM-Guided Self-Adaptive Training for Radiology Report Generation with Token-Level Push–Pull Optimization
- Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval
- Is Bin Generation Indispensable? A Bin-Generation-Free Dataset Quantization via Semantic Perspective
- MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry
- Guiding Token-Sparse Diffusion Models
- LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration
- GeneVAR: Causal MeanFlow for Autoregressive Gene-to-WSI Tile Synthesis
- CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning
- LayoutAD: Exploring Semantic-Geometric Misalignment Reasoning for Scene Layout Anomaly Detection
- MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents
- VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents
- GGPT: Geometry-Grounded Point Transformer
- LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
- See Through the Noise: Improving Domain Generalization in Gaze Estimation
- EasyV2V: A High-quality Instruction-based Video Editing Framework
- ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos
- Predict Before You Explore: Predictive Planning with Specialized Memory for Embodied Question Answering
- Life-IQA: Boosting Blind Image Quality Assessment through GCN-enhanced Layer Interaction and MoE-based Feature Decoupling
- Latent Diffusion Inversion Requires Understanding the Latent Space
- Video Panels for Long Video Understanding
- 4D Local Modeling Toward Dynamic Global Perception for Ambiguity-free Rotation-Invariant Point Cloud Analysis
- SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras
- Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images
- Discover, Segment, and Select: A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation
- Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision
- Easy2Hard: From Partially to Fully Unmatched Modalities as Negative Samples in Contrastive Learning
- Global-Aware Edge Prioritization for Pose Graph Initialization
- Efficient and Training-Free Single-Image Diffusion Models
- 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
- CRFT: Consistent–Recurrent Feature Flow Transformer for Cross-Modal Image Registration
- Vista4D: Video Reshooting with 4D Point Clouds
- InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation
- HybridDriveVLA: Vision-Language-Action Model with Visual CoT reasoning and ToT Evaluation for Autonomous Driving
- IVAAN: Instance-level Vision-Language Alignment via Attribute-Guided Text Prompts Generation for Nuclei Analysis
- FedRAC: Rolling Submodel Allocation for Collaborative Fairness in Federated Learning
- SEA: Evaluating Sketch Abstraction Efficiency via Element-level Commonsense Visual Question Answering
- Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models
- Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?
- Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation
- Merge3D: Efficient 3D Multimodal LLMs via Joint 2D-3D Token Merging
- MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning
- Anomaly-Related Residual Fields for Cross-domain Anomaly Detection
- Beyond the Ground Truth: Enhanced Supervision for Image Restoration
- NVGS: Neural Visibility for Occlusion Culling in 3D Gaussian Splatting
- LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol
- The Universal Normal Embedding
- Sampling-Aware Quantization for Diffusion Models
- PAMotion: Physics-Aware Motion Generation for Full-Body Interaction with Multiple Objects
- EXOTIC: External Vision-driven Incomplete Multi-view Classification
- GeoRelight: Learning Joint Geometrical Reconstruction and Relighting with Flexible Multi-Modal Diffusion Transformers
- Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
- SAG-GNN: Semantic-Aware Guided GNN for Descriptor-Free 2D-3D Matching
- VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation
- MVLM: Template-Free Tracking via Vision–Language Margin Confidence and Memory-Gated Tracking
- TLMA: Mitigating the Impact of Weakly Labeled Information for Video Anomaly Detection
- CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention
- Photo3D: Advancing Photorealistic 3D Generation through Structure‑Aligned Detail Enhancement
- Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models
- Align While Search: Belief-Guided Exploratory Inference for World-Grounded Embodied Agents
- Self-guided Semantic Inspection for Zero-Shot Composed Image Retrieval
- Reading or Reasoning? Format Decoupled Reinforcement Learning for Document OCR
- More than the Sum: Panorama-Language Models for Adverse Omni-Scenes
- One Algorithm to Align Them All
- LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models
- GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation
- MM-ACT: Learn from Multimodal Parallel Generation to Act
- Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection
- Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
- LaDy: Lagrangian-Dynamic Informed Network for Skeleton-based Action Segmentation via Spatial-Temporal Modulation
- SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution
- MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification
- LS-ViT: Least-Squares Hessian Based Block Reconstruction for Low-Bit Post-Training Quantization of Vision Transformers
- Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens
- MaskDexGrasp: Generative Masked Modeling for Part-Aware Dexterous Grasp Synthesis
- Revisiting the Necessity of Full Accuracy: Weakly Supervised Object-Level Offset Correction for Misaligned Building Labels
- Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos
- FastEventDGS: Deformable Gaussian Splatting for Fast Dynamic Scenes from a Single Event Camera
- QUANTIPHY: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models
- UniT: Unified Multimodal Chain-of-Thought Test-time Scaling
- From Rays to Projections: Better Inputs for Feed-Forward View Synthesis
- OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control
- When Do Models Actually Decide? Mapping the Layer-Wise Decision Timeline in Pretrained Neural Networks
- mmWaveFlow: Unified Enhancement and Generation of mmWave Human Point Clouds
- BiGain: Unified Token Compression for Joint Generation and Classification
- PhysHO: Physics-Based Dynamic 3D Gaussian Human and Object from Monocular Video
- Physically-Grounded Turbulence Mitigation with Frame-Shared Degradation Parameters
- Temporal Interaction in Spiking Transformers with Multi-Delay Mixer
- PHAC: Promptable Human Amodal Completion
- SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation
- Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding
- Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation
- ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting
- Dexterous World Models
- WRIVINDER: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery
- VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation
- StreamDiT: Real-Time Streaming Text-to-Video Generation
- Mechanisms of Object Localization in Vision–Language Models
- VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions
- LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
- UniLight: A Unified Representation for Lighting
- D-Convexity: A Unified Differentiable Convex Shape Prior via Quasi-Concavity for Data-driven Image Segmentation
- CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
- See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis
- Diffusion Mental Averages
- Simple but Effective Triplet-Based Compression Strategies for Compact Visual Localization
- Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration
- MUST: Modality-Specific Representation-Aware Transformer for Diffusion-Enhanced Survival Prediction with Missing Modality
- Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models
- SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls
- Goldilocks Test Sets for Face Verification
- MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models
- BAMI: Training-Free Bias Mitigation in GUI Grounding
- DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving
- Bridging the Perception Gap in Image Super-Resolution Evaluation
- PhysSkin: Real-Time and Generalizable Physics-Based Animation via Self-Supervised Neural Skinning
- Generalizable Sparse-View 3D Reconstruction from Unconstrained Images
- Urban-GS: A Unified 3D Gaussian Splatting Framework for Compact and High-Fidelity Aerial-to-Street Reconstruction
- Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection
- F^2HDR: Two-Stage HDR Video Reconstruction via Flow Adapter and Physical Motion Modeling
- ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars
- ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion
- Pointing at Parts: Training-Free Few-Shot Grounding in Multimodal LLMs
- GeoSemba: Reconstructing State Space Model for Cross Paradigm Representation in Medical Image Segmentation
- IPR-1: Interactive Physical Reasoner
- LA-Pose: Latent Action Pretraining Meets Pose Estimation
- Bidirectional Multimodal Prompt Learning with Scale-Aware Training for Few-Shot Multi-Class Anomaly Detection
- FG-Portrait: 3D Flow Guided Editable Portrait Animation
- C^2FG: Control Classifier-Free Guidance via Score Discrepancy Analysis
- Adapting Lightweight Image-based Counting Models for Video Crowd Counting
- TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment
- OpenDance: Multimodal Controllable 3D Dance Generation with Large-scale Internet Data
- InterPhys: Physics-aware Human Motion Synthesis in a Dynamic Scene
- DLVP-CLIP: Enhancing Fine-Grained Zero-Shot Anomaly Detection via Dynamic Local Visual Prompting
- MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning
- SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes
- Semantic Foam: Unifying Spatial and Semantic Scene Decomposition
- Topology-aware Feature Propagation for Unsupervised Non-rigid Point Cloud Correspondence
- Decision Boundary-aware Generation for Long-tailed Learning
- UCMNet: Uncertainty-Aware Context Memory Network for Under-Display Camera Image Restoration
- Visual Autoregressive Modeling via Next Focus Prediction
- LifeEval: A Multimodal Benchmark for Assistive AI in Egocentric Daily Life Tasks
- SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving
- GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction
- Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation
- Dropping Anchor and Spherical Harmonics for Sparse-view Gaussian Splatting
- Detecting Compressed AI-Generated Images via Phase Spectrum Robustness
- Revisiting Monocular SLAM with Spatio-Temporal Scene Modeling
- When Token Pruning is Worse than Random: Understanding Visual Token Information in VLLMs
- Immunizing Models Against Harmful Long-Horizon Fine-Tuning via Contractive Optimization Dynamics
- OS-Fed: One Snapshot Is All You Need
- Changes in Real Time: Online Scene Change Detection with Multi-View Fusion
- DBMSolver: A Training-free Diffusion Bridge Sampler for High-Quality Image-to-Image Translation
- Cross-domain Dual-stream Feature Disentanglement for Brain Disorder Prediction with Sparsely Labeled PET
- Sparse Spectral LoRA: Routed Experts for Medical VLMs
- Scale Space Diffusion
- Dark3R: Learning Structure from Motion in the Dark
- PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction
- RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework
- PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes
- SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting
- Rethinking Glyph Spatial Information in Font Generation
- Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search
- ReLaX: Reasoning with Latent Exploration for Large Reasoning Models
- RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing
- Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer
- Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation
- Structure-to-Intensity Diffusion for Adverse-Weather LiDAR Generation
- GenTract: Generative Global Tractography
- KV-Tracker: Real-Time Pose Tracking with Transformers
- Adapter Shield: A Unified Framework with Built-in Authentication for Preventing Unauthorized Zero-Shot Image-to-Image Generation
- ElasticFormer: Detecting Objects in HRW Shots via Elastic Computing Vision Transformer
- Plug-and-Play PDE Optimization for 3D Gaussian Splatting: Toward High-Quality Rendering and Reconstruction
- HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models
- SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding
- High-Fidelity Mobile Avatars with Pruned Local Blendshapes
- EVLF: Early Vision-Language Fusion for Generative Dataset Distillation
- SCIEval: Evaluating and Benchmarking the Faithfulness of Scientific Image Generation and Interpretation with Large Multimodal Models
- Match-and-Fuse: Consistent Generation from Unstructured Image Sets
- Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates
- Think-Then-Generate: Structural Chain-of-Thought Reasoning for Consistent 3D Generation
- ReaGEN: Adaptive Generation of Structured Chains-of-Thought for Efficient Multimodal Reasoning
- CamDirector: Towards Long-Term Coherent Video Trajectory Editing
- PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction
- Unleashing Stealthy Backdoor Pandemic by Infecting a Single Diffusion Model
- IAFMNet: Information-Aware Feature Modulation for Efficient Super-Resolution
- MPL: Match-guided Prototype Learning for Few-shot Action Recognition
- Content-Aware Dynamic Patchification for Efficient Video Diffusion
- UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in RL
- DiGraphHal-Bench: Evaluating Multimodal Large Language Models on Complex Directed Graphs
- Bridge: Basis-Driven Causal Inference Marries VFMs for Domain Generalization
- ExPose: Reinforcing Video Generation Models for Extreme Pose Estimation
- Inter-Photon-Limited Videography
- OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis
- A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks
- Evidential Neural Radiance Fields
- Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision–Language Understanding
- Coupled Diffusion Sampling for Training-Free Multi-View Image Editing
- Bridging RGB and Hematoxylin Components: An Interleaved Guidance and Fusion Framework for Point Supervised Nuclei Segmentation
- IDperturb: Enhancing Variation in Synthetic Face Generation via Angular Perturbations
- VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
- TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement
- HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps
- Human Geometry Distribution for 3D Animation Generation
- Through the Frequency Lens: Cross-Domain Generalisable Gaze Estimation with Adaptive Modulation
- Dense Metric Depth Completion from Sparse Direct Time-of-Flight Sensors
- SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
- Interactive Episodic Memory with User Feedback
- SHands: A Multi-View Dataset and Benchmark for Surgical Hand-Gesture and Error Recognition Toward Medical Training
- Hear What You See: Video-to-Audio Generation with Diffusion Transformer and Semantic-Temporal Alignment-Ranked Direct Preference Optimization
- Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding
- Mixture of States: Routing Token-Level Dynamics for Multimodal Generation
- MedLIME: A Distribution-Aligned and Evidence-Supported Framework for Medical Saliency Explanations
- Otil: Accelerating Diffusion Model Inference via Communication-Efficient Multi-GPU Parallelism
- BiomedCCPL: Causal Conditional Prompt Learning for Biomedical Vision-Language Models
- LumiMotion: Improving Gaussian Relighting with Scene Dynamics
- ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation
- Learning Forgery-Aware Lip Representations Without Forgery Priors
- PnP-CM: Consistency Models as Plug-and-Play Priors for Inverse Problems
- Fresco: Frequency–Spatial Consistent Optimization for Fine-Grained Head Avatar Modeling
- Landscape-Awareness for Geometric View Diffusion Model
- Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery
- Rascene: High-Fidelity 3D Scene Imaging with mmWave Communication Signals
- InsCal: Calibrated Multi-Source Fully Test-Time Prompt Tuning for Object Detection
- SASNet: Spatially-Adaptive Sinusoidal Networks for INRs
- Mark4D: Temporally-Consistent Watermarking for 4D Gaussian Splatting
- Hyperbolic Busemann Neural Networks
- GeCo-SRT: Geometry-aware Continual Adaptation for Cross-Task Sim-to-Real Transfer
- Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping
- StableMTL: Repurposing Latent Diffusion Models for Multi-Task Learning from Partially Annotated Synthetic Datasets
- Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout
- Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
- DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving
- Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation
- Guiding Diffusion Models with Semantically Degraded Conditions
- IMAIA: Interactive Maps AI Assistant for Travel Planning and Geo-Spatial Intelligence
- REACH: Explicit Recovery Behavior for Diffusion Policies
- D2T2 - Multimodal Automated Planning for Brachytherapy
- Attribution-Guided Model Rectification of Unreliable Neural Network Behaviors
- SketchRevive: Fine-Grained Pixel-to-Vector Sketch Completion with Diffusion-Prior-Guided Multimodal LLMs
- 3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding
- Spatio-Temporal Difference Guided Motion Deblurring with the Complementary Vision Sensor
- CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration
- Teaching DINOv3 About Partial 3D Geometry: A Self-Supervised Geometry-Aware Approach
- tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction
- ELiC: Efficient LiDAR Geometry Compression via Cross-Bit-depth Feature Propagation and Bag-of-Encoders
- DiffGraph: An Automated Agent-driven Model Merging Framework for In-the-Wild Text-to-Image Generation
- Benchmarking Endoscopic Surgical Image Restoration and Beyond
- ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and Test-time Generative Adaptation
- EvObj: Learning Evolving Object-centric Representations for 3D Instance Segmentation without Scene Supervision
- TimeBridge: Self-Supervised Video Representation Learning via Start-End Joint Embedding and In-Between Frame Prediction
- LoL: Longer than Longer, Scaling Video Generation to Hour
- DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
- SegGBC: Justifiable Coarse-to-Fine Granular-Ball Computing for Enhancing Clustering Image Segmentation
- MERIT: Multi-domain Efficient RAW Image Translation
- ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers
- UAV-CB: A Complex-Background RGB–T Dataset and Local Frequency Bridge Network for UAV Detection
- The Drift Kernel: Why Diffusion Models Change Even When Told Not To
- NS-Diff: Fluid Navier–Stokes Guided Video Diffusion via Reinforcement Learning
- Emergent Extreme-View Geometry in 3D Foundation Models
- M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh Reconstruction
- iLRM: An Iterative Large 3D Reconstruction Model
- FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts
- PvP: Data-Efficient Humanoid Robot Learning with Proprioceptive-Privileged Contrastive Representations
- Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning
- Learning 3D Reconstruction with Priors in Test Time
- ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference
- Revisiting Model Stitching In the Foundation Model Era
- Unified Camera Positional Encoding for Controlled Video Generation
- When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm
- Personalized Image Descriptions from Attention Sequences
- Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation
- Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
- UniChange: Unifying Change Detection with Multimodal Large Language Model
- PV-Ground: Text-Guided Point-Voxel Interaction for 3D Visual Grounding
- VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models
- FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching
- ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
- CrossHOI-Bench: A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods
- CineBrain: A Large-Scale Multi-Modal Audiovisual Brain Dataset for Brain-Conditioned Video Generation
- SAQN: Semantic-based Adaptive Query Network for 3D Referring Expression Segmentation
- DetectSCI: Toward Object-Guided ROI Reconstruction for High-Resolution Video Snapshot Compressive Imaging
- DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance
- Ego-Grounding for Personalized Question-Answering in Egocentric Videos
- AnthroTAP: Learning Point Tracking with Real-World Motion
- Beyond [CLS] Token: Query-Driven Token-Level Forgery Purification for Generalizable Deepfake Detection
- VecAttention: Vector-wise Sparse Attention for Accelerating Long Context Inference
- A Temporal and Content Co-Awareness Latent Diffusion for Controllable Hand Image Generation
- FlashVSR: Towards Real-time Diffusion-Based Streaming Video Super Resolution
- GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics
- RenderFlow: Single-Step Neural Rendering via Flow Matching
- WalkGPT: Grounded Vision–Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation
- PARSE: Part-Aware Relational Spatial Modeling
- Smoothing the Score Function to Enhance Generalization in Diffusion Models
- Multi-Hierarchical Contrastive Spectral Fusion for Multi-View Clustering
- A Difference-in-Difference Approach to Detecting AI-Generated Images
- BriMA: Bridged Modality Adaptation for Multi-Modal Continual Action Quality Assessment
- Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations
- Seeing Motion Through Polarity for Event-based Action Recognition
- R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space
- Why Not Hyperparameter-Friendly Optimisation? A Monotonic Adaptive Norm Rescaling Approach For Long-Tailed Recognition
- Investigating Self-Supervised Representations for Audio-Visual Deepfake Detection
- Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation
- Cov2Pose: Leveraging Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation
- Let it Snow! Animating 3D Gaussian Scenes with Dynamic Weather Effects via Physics-Guided Score Distillation
- 3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds
- RFDM: Residual Flow Diffusion Models for Video Editing
- StreamReady: Learning What to Answer and When in Long Streaming Videos
- FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution
- MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation
- Anti-I2V: Safeguarding your Photos from Malicious Image-to-video Generation
- Concept-Aware LoRA for Domain-Aligned Segmentation Dataset Generation
- TIACam: Text-Anchored Invariant Feature Learning with Auto-Augmentation for Camera-Robust Zero-Watermarking
- Scaling Test-Time Robustness of Vision-Language Models via Self-Critical Inference Framework
- ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting
- The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition
- Learning Where to Look and How to Judge: Resolution-agnostic Image Quality Assessment with Quality-aware Saliency
- VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment
- AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation
- Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
- Robust Promptable Video Object Segmentation
- Generative Modeling of Weights: Generalization or Memorization?
- See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
- Controllable Federated Prompt Learning at Test Time
- MM-OVSeg: Multimodal Optical–SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing
- MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE
- Progressive Mask Distillation for Self-supervised Video Representation
- CoWTracker: Tracking by Warping instead of Correlation
- DEVA: Fine-tuning Multimodal Large Language Models for Visual Perception Tasks
- EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors
- BiEvLight: Bi-level Learning of Task-Aware Event Refinement for Low-Light Image Enhancement
- Progressive Supernet Training for Efficient Visual Autoregressive Modeling
- Collaborative Multi-Mode Pruning for Vision-Language Models
- AdvFM: Lookahead Flow-Matching Velocity-Field Attacks for Imperceptible and Transferable Adversarial Examples
- DROID-SLAM in the Wild
- DPL: Decoupled Prototype Learning for Enhancing Robustness of Vision–Language Transformers to Missing Modalities
- From Panel to Pixel: Zoom-In Vision–Language Pretraining from Biomedical Scientific Literature
- PriVi: Towards a General-Purpose Video Model for Primate Behavior in the Wild
- SpiralDiff: Spiral Diffusion with LoRA for RGB-to-RAW Conversion Across Cameras
- Hint2Gen: Bridging Understanding and Generation via Code-structured Hints
- SIR: Structured Image Representations for Explainable Robot Learning
- SkillSight: Efficient First-Person Skill Assessment with Gaze
- VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA
- Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
- FlowMotion: Training-Free Flow Guidance for Video Motion Transfer
- RDFace: A Benchmark Dataset for Rare Disease Facial Image Analysis under Extreme Data Scarcity and Phenotype-Aware Synthetic Generation
- Generative Neural Video Compression via Video Diffusion Prior
- GH-NAF: Grid-Adaptive Hash-Level–Attended Neural Attenuation Fields for Discrepancy-Aware CBCT
- FISHuman: Fine-grained Single-image 3D Human Reconstruction via Multi-view 4D Remeshing
- HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
- Towards Storytelling Animations: Joint Synthesis of Human and Camera Motions
- Mining Instance-Centric Vision–Language Contexts for Human–Object Interaction Detection
- Measuring the (Un)Faithfulness of Concept-Based Explanations
- Scaling Parallel Sequence Models to Vision Foundation Models
- Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
- StreamVLO: Streaming Visual–LiDAR Odometry with Cumulative Drift Compensation
- Beyond Text Prompts: Precise Concept Erasure through Text–Image Collaboration
- From Intuition to Investigation: A Tool-Augmented Reasoning MLLM Framework for Generalizable Face Anti-Spoofing
- LangRef3DGS: Natural Language-Guided 3D Referential Segmentation from Partial Observations via 3D Gaussian Splatting
- It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models
- Causal Motion Diffusion Models for Autoregressive Motion Generation
- Computer Vision with a Superpixelation Camera
- MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator
- Learnability-Guided Diffusion for Dataset Distillation
- The Invisible Gorilla Effect in Out-of-distribution Detection
- Nonlinear Color Transfer via Learnable Bezier Flows
- CAST: Context-Aware Dynamic Latent Space Transformation for Interactive Text-to-Image Retrieval
- Language Models Can Explain Visual Features via Steering
- Cinematic Audio Source Separation Using Visual Cues
- Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation
- Failure Modes for Deep Learning–Based Online Mapping: How to Measure and Address Them
- LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings
- Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
- PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting
- Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models
- Towards Intrinsic-Aware Monocular 3D Object Detection
- DeX-Portrait: Disentangled and Expressive Portrait Animation via Explicit and Latent Motion Representations
- CoT-Edit: Let CoT Guide Instruction Video Editing
- Language-guided Frequency Modulation for Large Vision-Language Models
- TopoCL: Topological Contrastive Learning for Medical Imaging
- Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision
- Semantic Alignment for Pose-Invariant Identity Preserving Diffusion
- Exposing and Evaluating Hallucinations for GUI Grounding
- Image-based Outlier Synthesis With Training Data
- Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing
- DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer
- Geometrically-Constrained Agent for Spatial Reasoning
- Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection
- MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision
- TempoControl: Temporal Attention Guidance for Text-to-Video Models
- Inter-Edit: First Benchmark for Interactive Instruction-Based Image Editing
- RECS4R: Bridging Semantics and Geometry for Referring Remote Sensing Interpretation
- TokenSplat: Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction
- GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings
- MultiCrafter: High-Fidelity Multi-Subject Generation via Disentangled Attention and Identity-Aware Preference Alignment
- Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models
- TokenTrace: Multi-Concept Attribution through Watermarked Token Recovery
- Human-like Abstract Visual Reasoning via Understanding and Solving Reasoning Loop
- HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing
- Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences
- Δynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos
- FedAdamom: Adaptive Momentum for Improved Generalization in Federated Optimization
- MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes
- LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction
- EvoID: Reinforced Evolution for Identity-Preserving Video Generation
- Photo-Guided Tooth Segmentation on 3D Oral Scan Model
- Interpretable Debiasing of Vision-Language Models for Social Fairness
- Post-training Feature Pruning for Fundus Images Classification
- Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation
- LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs
- Revisiting Visual Corruptions in LVLMs: A Shape–Texture Perspective on Model Failures
- MacTok: Robust Continuous Tokenization for Image Generation
- Anchor-Guided Gradient Alignment for Incomplete Multimodal Learning
- Efficient Unrolled Networks for Large-Scale 3D Inverse Problems
- Stability-Driven Motion Generation for Object-Guided Human-Human Co-Manipulation
- Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting
- Cross-Domain Few-Shot Segmentation via Multi-view Progressive Adaptation
- Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining
- KαLOS finds Consensus: A Meta-Algorithm for Evaluating Inter-Annotator Agreement in Complex Vision Tasks
- Multi-view Pyramid Transformer: Look Coarser to See Broader
- Hyperbolic Gramian Volumes for Multimodal Alignment
- PhysVid: Physics Aware Local Conditioning for Generative Video Models
- Learning by Analogy: A Causal Framework for Compositional Generalization
- Visual Diffusion Models are Geometric Solvers
- VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction
- SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
- Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework
- Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers
- HierAmp: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation
- More Than Meets the Eye: A Unified Image Fusion Framework via Semantic-Pixel Entropy Trade-off for Zero-Shot Generalization
- RL‑ScanIQA: Reinforcement-Learned Scanpaths for Blind 360° Image Quality Assessment
- Pose-guided Enriched Feature Learning for Federated-by-camera Person Re-identification
- MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene
- HFedATM: Hierarchical Federated Domain Generalization via Optimal Transport and Regularized Mean Aggregation
- Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control
- DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration
- 2ndMatch: Finetuning Pruned Diffusion Models via Second-Order Jacobian Matching
- PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
- Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation
- Improving Sparse Autoencoder with Dynamic Attention
- When Lines Meet Textures: Spatial-Frequency Aligned Diffusion Features for Cross-Sparsity Correspondence
- Medic-AD: Towards Medical Vision-Language Model's Clinical Intelligence
- Unified Vector Floorplan Generation via Markup Representation
- LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving
- Drainage: A Unifying Framework for Addressing Class Uncertainty
- MoLingo: Motion–Language Alignment for Text-to-Human Motion Generation
- Decoupling Vision and Language: Codebook Anchored Visual Adaptation
- OmniBrainBench: A Comprehensive Multimodal Benchmark for Brain Imaging Analysis Across Multi-stage Clinical Tasks
- Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark
- PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models
- PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis
- PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories
- Abstract 3D Perception for Spatial Intelligence in Vision-Language Models
- SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
- Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models
- CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning
- PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning
- BEA-GS: BEyond RAdiance Supervision in 3DGS for Precise Object Extraction
- Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction
- VideoCoF: Unified Video Editing with Temporal Reasoner
- Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models
- Beyond Single Images: A Comprehensive Benchmark for Album-Level Vision-Language Understanding
- CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation
- Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
- DIMOS: Disentangling Instance-level Moving Object Segmentation
- SFR-Net: Steering-Fusion-Refining Network in Multi-label Zero-Shot Sewer Defect Detection
- Differentiable Laplacian Matrix Guided Superpixel Segmentation
- Neural Differentiation in Deep Networks: A Theoretical Framework for Expressivity and Representational Diversity
- Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress
- Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
- SpiderCam: Low-Power Snapshot Depth from Differential Defocus
- Machine Unlearning via Adaptive Gradient Reweighting and Multi-stage Objective Optimization
- ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models
- AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation
- TaskForce: Cooperative Multi-agent Reinforcement Learning for Multi-task Optimization
- CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation
- PromptLoop: Plug-and-Play Prompt Refinement via Latent Feedback for Diffusion Model Alignment
- FastRef: Fast Prototype Refinement for Few-shot Industrial Anomaly Detection
- Deformation-based In-Context Learning for Point Cloud Understanding
- MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectioanl Blending with Hierarchical Densification
- MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models
- DuetMerging: Synergizing Dynamic and Static Strategies for Mitigating Task Interference in Model Merging
- Learning to Infer Parameterized Representations of Plants from 3D Scans
- Boundary-Responsive Differentiable Gating for Superpixel-Based Segmentation
- Making the Classification Explanation Faithful to the Confidence Score
- Specificity-aware reinforcement learning for fine-grained open-world classification
- UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders
- ReBaPL: Repulsive Bayesian Prompt Learning
- Rethinking Two-Stage Referring-by-Tracking in Referring Multi-Object Tracking: Make it Strong Again
- CountGD++: Generalized Prompting for Open-World Counting
- Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation
- Hyper-PCN: Hypergraph-Based Point Cloud Completion via High-Order Correlation Modeling
- SketchFaceGS: Real-Time Sketch-Driven Face Editing and Generation with Gaussian Splatting
- SimScale: Learning to Drive via Real-World Simulation at Scale
- FedRE: A Representation Entanglement Framework for Model-Heterogeneous Federated Learning
- TagSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and Tracking
- Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation
- DeltaQuant: 4-bit Video Diffusion Models with Spatiotemporal Delta Smoothing
- StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
- MusicInfuser: Making Video Diffusion Listen and Dance
- ShapeAR: Generating Editable Shape Layers via Autoregressive Diffusion
- DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers
- Learning Straight Flows: Variational Flow Matching for Efficient Generation
- Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution Can Improve Robust Scene Graph Generation
- Solving a Nonlinear Blind Inverse Problem for Tagged MRI with Physics and Deep Generative Priors
- Mapping Networks
- Forecasting 3D Scanpaths in Egocentric Video
- Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
- MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection
- Probabilistic Prompt Adaptation for Unified Image Aesthetics and Quality Assessment
- SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models
- TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation
- NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
- SemLayer: Semantic-aware Generative Segmentation and Layer Construction for Abstract Icons
- Latent Chain-of-Thought World Modeling for End-to-End Driving
- GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis
- BluRef: Unsupervised Image Deblurring with Dense-Matching References
- Spatiotemporal Pyramid Flow Matching for Climate Emulation
- From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction
- EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing
- PS-SR: Pseudo-Single-Step Video Super-Resolution via Speculative Diffusion
- AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions
- When Pretty Isn’t Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators
- Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding
- Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception
- IFCSR: Inference-Free Fidelity-Realism Control for One-Step Diffusion-based Real-World Image Super-Resolution
- FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants
- Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation
- OctoNav: Towards Generalist Embodied Navigation
- MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data
- MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models
- RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations
- GOR-IS: 3D Gaussian Object Removal In the Intrinsic Space
- Learning from Oblivion: Predicting Knowledge-Overflowed Weights via Retrodiction of Forgetting
- EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions
- MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks
- Uni-Hema: Unified Model for Digital Hematopathology
- Seeing Conversations: Communication Context Identification in Egocentric Video
- Object-WIPER: Training-Free Object and Associated Effect Removal in Videos
- ActiveGrasp: Information-Guided Active Grasping with Calibrated Energy-based Model
- TaskIT: Memory-Efficient Fine-Tuning of Multi-LoRA LLMs via Cross-Task Importance Transfer
- SECOS: Semantic Capture for Rigorous Classification in Open-World Semi-Supervised Learning
- Diffusion MRI Transformer with a Diffusion Space Rotary Positional Embedding (D-RoPE)
- AeroGS: Scale-Aware Gaussian Splatting for Pose-Free Dynamic UAV Scene Reconstruction
- TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size
- GPFlow: Gaussian Prototype Probability Flow for Unsupervised Multi-Modal Anomaly Detection
- Learning Convex Decomposition via Feature Fields
- OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning
- RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning
- FILTR: Extracting Topological Features from Pretrained 3D Models
- Good Can Sometimes be Bad: A Unified Attack against 3D Point Cloud Classifier by a Flexible Isotropic Resampling
- SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning
- Virtual Nodes Guided Dynamic Graph Neural Network for Brain Tumor Segmentation with Missing Modalities
- RankOOD - Class Ranking-based Out-of-Distribution Detection
- EDGS: Eliminating Densification for Efficient Convergence of 3DGS
- RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation
- ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval
- EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories
- Learning Long-term Motion Embeddings for Efficient Kinematics Generation
- DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing
- End-to-End Language-Action Model for Humanoid Whole Body Control
- HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT
- Lenses: Toward Polysemous Vision–Language Understanding
- D2Cache: Second-Order Delta Caching for Higher Video Diffusion Acceleration
- Towards Calibrating Prompt Tuning of Vision- Language Models
- Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols
- Image-to-Point Cloud Feature Back-Projection for Multimodal Training of 3D Semantic Segmentation
- Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning
- CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection
- Deciphering Genotype-Phenotype Mechanisms from High-Content Profiling via Knowledge-Guided Multi-modal Graph Learning
- MIBURI: Towards Expressive Interactive Gesture Synthesis
- IMU-HOI: A Symbiotic Framework for Coherent Human-Object Interaction and Motion Capture via Contact-Conscious Inertial Fusion
- OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens
- Resolving the Identity Crisis in Text-to-Image Generation
- Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach
- TrafficAlign: Aligning Large Language Models for Traffic Scenario Generation
- Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images
Report issues here.
Successful Page Load