MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models
Abstract
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in multimodal understanding, yet their positional encoding mechanisms remain fundamentally limited. Current approaches apply uniform positional indices across all tokens, failing to account for dramatic variations in information density between and within modalities. This uniform treatment leads to suboptimal attention allocation and inefficient cross-modal fusion. We introduce MODIX (Multimodal Information-Driven Positional Index Scaling), a training-free framework that dynamically adapts positional granularity based on information-theoretic analysis of modality contributions. By jointly quantifying intrinsic information density within each modality and cross-modal interaction strength, MODIX assigns finer positional strides to information-rich content and coarser strides to redundant regions. Operating purely at inference time, our method requires no architectural modifications or retraining, enabling plug-and-play integration with existing VLMs. Comprehensive experiments across multiple state-of-the-art architectures and six benchmarks demonstrate that MODIX consistently improves multimodal reasoning, achieving up to 8.4% gains on ScienceQA and 6.8% on RealWorldQA, while dynamically adapting positional resolution to task-specific information distributions.