Revisiting Visual Corruptions in LVLMs: A Shape–Texture Perspective on Model Failures
Abstract
Large vision–language models (LVLMs) are highly vulnerable to visual corruptions, substantially compromising their reliability and limiting real-world deployment. Prior work has attributed this degradation primarily to insufficient visual grounding and overreliance on language priors. However, these explanations often overlook the heterogeneous nature of corruptions, which perturb model perception in fundamentally different ways. We revisit this problem from a corruption-centric perspective and show that diverse corruptions can be organized along two complementary perceptual dimensions—shape and texture—which induce distinct failure modes. To address them, we propose Shape–Texture Dual-Path Contrastive Decoding (ST-CD), a training-free inference framework that constructs complementary contrastive views to diagnose and correct shape- and texture-induced biases through adaptive fusion. Experiments across multiple LVLMs and robustness benchmarks demonstrate that ST-CD consistently improves robustness under heterogeneous corruptions, suggesting that leveraging the complementarity between shape and texture provides a general and effective principle for building robust multimodal models.