Skip to yearly menu bar Skip to main content


Poster

CLIP is Almost All You Need: Towards Parameter-Efficient Scene Text Retrieval without OCR

Xugong Qin · peng zhang · Jun Jie Ou Yang · Gangyan Zeng · Yubo Li · Yuanyuan Wang · Wanqian Zhang · Pengwen Dai


Abstract:

Scene Text Retrieval (STR) seeks to identify all images containing a given query string. Existing methods typically rely on an explicit Optical Character Recognition (OCR) process of text spotting or localization, which is susceptible to complex pipelines and accumulated errors. To settle this, we resort to the Contrastive Language-Image Pre-training (CLIP) models, which have demonstrated the capacity to perceive and understand scene text, making it possible to achieve strictly OCR-free STR. From the perspective of parameter-efficient transfer learning, a lightweight visual position adapter is proposed to provide a positional information complement for the visual encoder. Besides, we introduce a visual context dropout technique to improve the alignment of local visual features. A novel, parameter-free cross-attention mechanism transfers the contrastive relationship between images and text to that between tokens and text, producing a rich cross-modal representation, which can be utilized for efficient reranking with a linear classifier. The resulting model, CAYN, achieves new state-of-the-art performance on the STR task, with 92.46\%/89.49\%/85.98\% mAP on the SVT/IIIT-STR/TTR datasets at 38.79 FPS on a single GeForce GTX 1080 Ti. Our findings demonstrate that CLIP can serve as a reliable and efficient solution for OCR-free STR, with no more than 0.50M additional parameters required. The code will be made publicly available.

Live content is unavailable. Log in and register to view live content