Phrase-Grounding-Aware Supervised Fine-Tuning for Chart Recognition via Side-Masked Attention
Abstract
Recent advances in chart recognition have been driven by the supervised fine-tuning (SFT) of vision-language models (VLMs), which unify multiple related tasks, and by diversifying training corpora. In parallel, research on leveraging large language models (LLMs) for object detection has shown that jointly training phrase grounding alongside SFT enhances a model’s generative capabilities.Inspired by this, we hypothesize that chart recognition can also benefit from phrase grounding, which aligns textual phrases with chart regions—a setting that remains underexplored due to the lack of corresponding datasets.In this work, we introduce a phrase-grounding-aware SFT via a Side-Masked Attention Module (SMAM), which is inserted into each transformer layer of the LLM. SMAM performs masked attention within the annotated region—aligned with the corresponding phrase—to produce an additional logit. We supervise this logit and use it as a reference to guide the LLM’s output prediction during fine-tuning, alongside the standard SFT objective. To enable this approach, we also develop an automated pipeline for generating phrase-to-region alignments, which augments existing datasets. Experiments show that our method effectively incorporates phrase grounding into chart recognition via VLM fine-tuning. Code and datasets will be released upon acceptance.