Language-guided Frequency Modulation for Large Vision-Language Models
Abstract
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in visual reasoning across diverse tasks. These tasks place different demands on visual representations: some prioritize high-level global context, while others emphasize fine-grained local details. However, most existing methods operate on visual representations primarily in the spatial domain, lacking an explicit mechanism for distinguishing between high-frequency local details and low-frequency global context. This limitation hinders fine-grained control of visual representations and complicates their hierarchical alignment with language. To address this issue, we introduce Language-guided Frequency Modulation (LFM), a plug-and-play approach that adaptively refines visual signals in the frequency domain under linguistic guidance. By selectively enhancing critical regions and details, LFM enables more structured and precise visual processing. Crucially, it adds no extra training parameters, relying solely on a lightweight learnable projector to refine visual tokens before integration into the LLM, thereby ensuring minimal computational overhead. Extensive experiments across diverse vision-language benchmarks highlight LFM’s scalability, effectiveness, and broad applicability to LVLMs. The code will be publicly available.