Skip to yearly menu bar Skip to main content


TRINS: Towards Multimodal Language Models that Can Read

Ruiyi Zhang · Yanzhe Zhang · Jian Chen · Yufan Zhou · Jiuxiang Gu · Changyou Chen · Tong Sun

Arch 4A-E Poster #294
[ ]
Fri 21 Jun 10:30 a.m. PDT — noon PDT


Large multimodal language models have shown remarkable proficiency in understanding and editing images. However, a majority of these visually-tuned models struggle to comprehend the textual content embedded in images, primarily due to the limitation of training data. In this work, we introduce TRINS: a Text-Rich image\footnote{In this work, we use the phrase ``text-rich images'' to describe images with rich textual information, such as posters and book covers.} INStruction dataset, with the objective of enhancing the reading ability of the multimodal large language model. TRINS is built using hybrid data annotation strategies including machine-assisted and human-assisted annotation process. It contains 39,153 text-rich images, captions and 102,437 questions. Specifically, we show that the number of words per annotation in TRINS is significantly longer than that of related datasets, providing new challenges. Furthermore, we introduce a simple and effective architecture, called Language-vision Reading Assistant (LaRA), that is good at understanding textual contents within images. LaRA outperforms existing state-of-the-art multimodal large language models on the TRINS dataset as well as other classical benchmarks. Lastly, we conducted a comprehensive evaluation with TRINS on various text-rich image understanding and generation tasks, demonstrating its effectiveness.

Live content is unavailable. Log in and register to view live content