Exploring Visual Pretraining for Learning Language Intelligence
Abstract
While the most fundamental pretraining paradigm typically trains modality-specific models on their respective datasets, the Platonic Representation Hypothesis that representations eventually align across modalities as data and model scale suggests an intriguing possibility: large language models (LLMs) could be pretrained on visual corpora to reach parity with text-pretrained models, thereby expanding data sources to break the text-scaling bottlenecks, and leveraging richer visual cues for more comprehensive corpus understanding. This paper makes the first attempt to demonstrate the feasibility of this implication by introducing Masked Autoregressive Pretraining for Learning language intelligencE (MAPLE), a novel visual pretraining paradigm for LLMs that leverages raw document images to improve language intelligence. MAPLE is universal to integrate masked auto-regressive models with various LLM backbones, where the LLMs are incentivized to generate latent hypotheses for the masked regions based on the unmasked regions. We verify MAPLE in the domain of math reasoning with multiple LLM backbones and show that MAPLE consistently surpasses text-only pretraining relatively by at most 40.2\% on average accuracy across four math reasoning benchmarks. Further analyses show that visually pretrained LLMs learn a shared latent space that aligns document visuals with text and exploits layout and structural cues, supporting visual pretraining as a feasible and scalable route to stronger language models.