Poster
CLIP is Almost All You Need: Towards Parameter-Efficient Scene Text Retrieval without OCR
Xugong Qin · peng zhang · Jun Jie Ou Yang · Gangyan Zeng · Yubo Li · Yuanyuan Wang · Wanqian Zhang · Pengwen Dai
Scene Text Retrieval (STR) seeks to identify all images containing a given query string. Existing methods typically rely on an explicit Optical Character Recognition (OCR) process of text spotting or localization, which is susceptible to complex pipelines and accumulated errors. To settle this, we resort to the Contrastive Language-Image Pre-training (CLIP) models, which have demonstrated the capacity to perceive and understand scene text, making it possible to achieve strictly OCR-free STR. From the perspective of parameter-efficient transfer learning, a lightweight visual position adapter is proposed to provide a positional information complement for the visual encoder. Besides, we introduce a visual context dropout technique to improve the alignment of local visual features. A novel, parameter-free cross-attention mechanism transfers the contrastive relationship between images and text to that between tokens and text, producing a rich cross-modal representation, which can be utilized for efficient reranking with a linear classifier. The resulting model, CAYN, achieves new state-of-the-art performance on the STR task, with 92.46\%/89.49\%/85.98\% mAP on the SVT/IIIT-STR/TTR datasets at 38.79 FPS on a single GeForce GTX 1080 Ti. Our findings demonstrate that CLIP can serve as a reliable and efficient solution for OCR-free STR, with no more than 0.50M additional parameters required. The code will be made publicly available.
Live content is unavailable. Log in and register to view live content