Twin-T & TwintVQA: A Reliable Structure–Detail Separating VLM and a Comprehensive Benchmark for Chart and Table Tasks
Jiahua Bao ⋅ Siyao Cheng ⋅ Jiaxing Du ⋅ Qingtao Xia ⋅ Changjiang He ⋅ Zeming Lang ⋅ Jie Liu
Abstract
With the rapid development of Vision-Language Models (VLMs), there is a growing demand for automatic analysis of structured visual data. Charts and tables are primary carriers of quantitative information, with regular layouts and explicit numbers. However, current general VLMs and expert models make limited use of these chart-table features during training and inference. Another challenge is cross-format conversion in realistic settings, as chart and table outputs span Python and LaTeX and most VLMs struggle to handle this breadth reliably. These gaps often lead to analysis mistakes, and unreliable generation text. To overcome these limitations, we propose $\underline{\texttt{\textbf{Twin-T}}}$, a two-stage expert VLM for comprehensive char$\underline{\texttt{\textbf{t}}}$-$\underline{\texttt{\textbf{t}}}$able tasks across Image, LaTeX, and Python. In stage 1, we propose a novel dual-head image encoder that can separate structural cues and fine details from input images. In stage 2, we propose MINT, a preference learning method that emphasizes numbers and keywords fidelity and vision–text matching. Furthermore, we introduce a comprehensive TwintVQA benchmark with 17 chart types, 11 task types, 3 data formats and short / medium / long QA settings. Our model narrows the gap between open-source and closed-source models on mainstream chart–table benchmarks, outperforming open-source models and GLM-4.5V-106B while even remaining competitive with GPT-4o and Gemini-2.5-Pro. Our code and additional details are available in the Appendix.
Successful Page Load