Skip to yearly menu bar Skip to main content


Poster

SynTab-LLaVA: Enhancing Multimodal Table Understanding with Decoupled Synthesis

Bangbang Zhou · Zuan Gao · Zixiao Wang · Boqiang Zhang · Yuxin Wang · Zhineng Chen · Hongtao Xie


Abstract:

Due to the limited scale of multimodal table understanding (MTU) data, model performance is constrained. A straightforward approach is to use multimodal large language models to obtain more samples, but this may cause hallucinations, generate incorrect sample pairs, and cost significantly. To address the above issues, we design a simple yet effective synthesis framework that consists of two independent steps: table image rendering and table question and answer (Q\&A) pairs generation. We use table codes (HTML, LaTeX, Markdown) to synthesize images and generate Q\&A pairs with large language model (LLM). This approach leverages LLM’s high concurrency and low cost to boost annotation efficiency and reduce expenses. By inputting code instead of images, LLMs can directly access the content and structure of the table, reducing hallucinations in table understanding and improving the accuracy of generated Q\&A pairs. Finally, we synthesize a large-scale MTU dataset, SynTab, containing 636K images and 1.8M samples costing within $200 in US dollars. We further introduce a generalist tabular multimodal model, SynTab-LLaVA. This model not only effectively extracts local textual content within the table but also enables global modeling of relationships between cells. SynTab-LLaVA achieves SOTA performance on 21 out of 24 in-domain and out-of-domain benchmarks, demonstrating the effectiveness and generalization of our method. Some data is provided in the supplementary materials and will be open-sourced later.

Live content is unavailable. Log in and register to view live content