Skip to yearly menu bar Skip to main content


Poster

Enhancing Vision-Language Pretraining with Rich Supervisions

Yuan Gao · Kunyu Shi · Pengkai Zhu · Edouard Belval · Oren Nuriel · Srikar Appalaraju · Shabnam Ghadar · Zhuowen Tu · Vijay Mahadevan · Stefano Soatto


Abstract:

We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshot unlock a treasure trove of visual and textual cues that are simply not present in using image-text pairs. In S4, we leverage the inherent tree-structure hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resembles downstream tasks across different domains and the annotations are cheap to obtain. We demonstrate that, comparing to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection, and at least 1% on Widget Captioning.

Live content is unavailable. Log in and register to view live content