Seeing What Matters: A Training-Free Self-Guided Framework for Multimodal Detail Perception and Reasoning
Abstract
Multimodal large language models (MLLMs) have achieved remarkable success on diverse visual-language tasks. However, fixed-resolution models face challenges in perceiving fine-grained visual details, particularly due to distracted attention and blurry vision. To address these issues, we propose SLoFo, a training-free and self-guided inference framework that mimics the human "Scan-Locate-Focus" process. SLoFo first adopts a dual-branch mechanism to identify critical image regions: the Semantic branch constructs a gradient-based semantic relevance map, and the Structure branch estimates visual token uniqueness offering complementary and robust evidence. By combining both branches, SLoFo perceives and explicitly crop critical regions. During inference, with additional cropped sub-image, SLoFo applies a progressive visual token pruning strategy to improve attention focus on key areas while reducing computational overhead. Experiments on detail-sensitive and general-purpose benchmarks show that SLoFo consistently improves accuracy (+4.79% on TextVQA, +2.62% on GQA) and robustness (+4.60% on POPE-MSCOCO adversarial) without training or external modules.