Skip to yearly menu bar Skip to main content


Poster

Non-Natural Image Understanding with Advancing Frequency-based Vision Encoders

w l · Qingsong Wang · Yueying Feng · Shulei Wang · Tao Jin · Zhou Zhao · Fei Wu · Chang Yao · Jingyuan Chen


Abstract:

Large language models (LLMs) have significantly enhanced cross-modal understanding capabilities by integrating visual encoders with textual embeddings, giving rise to multimodal large language models (MLLMs). However, these models struggle with non-natural images such as geometric and charts, particularly in fields like education and finance. Despite efforts to collect datasets and fine-tune the MLLMs, the gap with natural image understanding is still evident, and the cost of collecting large and diverse non-natural image datasets is high. To address this, we analyzed the limitations of transformer-based vision encoders(ViT) within existing MLLMs from a frequency perspective. Studies have shown that ViT models are less effective at capturing high-frequency information, impairing their ability to capture elements like points, lines, and angles in non-natural images. In response, we introduced FM-ViT, a frequency-modulated vision encoder that utilizes Fourier decomposition to extract high and low frequency components from self-attention features and re-weight them during tuning to non-natural images. In addition, we combine the features of CNN models with FM-ViT and propose EDGE, an MLLM with enhanced graphical encoders tailored for understanding non-natural images. Extensive experiments have confirmed the effectiveness of our FM-ViT and EDGE in 4 types of comprehension tasks (classification, retrieval, captioning, and question answering) on 3 types of non-natural images (geometric, charts, and functional).

Live content is unavailable. Log in and register to view live content