TableMix: Enhancing Multimodal Table Reasoning in MLLMs from a Data-Centric Perspective
Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have enabled promising progress in table reasoning from visual table inputs. Despite their ability to capture rich visual cues such as color and layout, MLLMs still underperform compared to text-only models.We argue that a major limitation lies in the pre-training process, which inadvertently weakens the model’s intrinsic reasoning ability and consequently hinders the effectiveness of reinforcement fine-tuning on table reasoning tasks.In this paper, we introduce TableMix, a novel framework that tackles this challenge from a data-centric perspective. At the core of TableMix is a principled data mixing strategy. Specifically, TableMix constructs a hybrid dataset that combines: (1) multimodal table reasoning data to improve task-specific reasoning, (2) text-only mathematical reasoning data to revive the model’s logical competence, and (3) simple multimodal perception data to preserve visual grounding.Recognizing the non-uniform difficulty of mixed data, we further propose a Difficulty-Aware Reward Shaping (DRS) mechanism, which enables the Group Relative Policy Optimization (GRPO) algorithm to adaptively reward concise reasoning for easy problems while encouraging more elaborate reasoning for complex ones, thereby reducing redundant computation and errors.Extensive experiments show that TableMix markedly enhances the reasoning ability of MLLMs, outperforming strong multimodal baselines and even rivaling state-of-the-art text-only models.