CoIn: Coverage and Informativeness-Guided Token Reduction for Efficient Large Multimodal Models
Abstract
Large Multimodal Models (LMMs) have shown remarkable success in visual understanding tasks. LMMs encode visual and textual inputs into tokens, which are then processed by Large Language Models (LLMs). However, the large number of visual tokens poses a major bottleneck for inference efficiency and memory usage. Reducing visual tokens is a promising training-free solution, but existing methods remain limited. Importance-based approaches suffer from poor generalization, are incompatible with kernel-level inference optimizations, and only consider information from a single modality. Diversity-based strategies typically focus on pairwise token redundancy and treat all tokens as equally important. Recent attempts to sequentially combine importance and diversity criteria still fail to address the intrinsic drawbacks of their underlying metrics. To address these limitations, we reformulate visual token reduction as an optimal subset selection problem jointly guided by two complementary objectives: informativeness and coverage. Informativeness is quantified through per-token intrinsic saliency and visual–textual alignment, while coverage is enforced via a volume-based subset selection criterion that ensures global representativeness in the visual feature space.This joint formulation effectively integrates visual saliency, cross-modal alignment, and global coverage in an end-to-end token selection process, yielding a computationally efficient, model-agnostic framework compatible with modern inference accelerators. Extensive experiments demonstrate that CoIn substantially reduces computation and memory cost while maintaining strong task performance. We will release our code once accepted.