Skip to yearly menu bar Skip to main content


OTE: Exploring Accurate Scene Text Recognition Using One Token

Jianjun Xu · Yuxin Wang · Hongtao Xie · Yongdong Zhang

Arch 4A-E Poster #401
[ ]
Fri 21 Jun 5 p.m. PDT — 6:30 p.m. PDT


In this paper, we propose a novel framework to fully exploit the potential of a single vector for scene text recognition (STR). Different from previous sequence-to-sequence methods that rely on a sequence of visual tokens to represent scene text images, we prove that just one token is enough to characterize the entire text image and achieve accurate text recognition. Based on this insight, we introduce a new paradigm for STR, called One Token Ecognizer (OTE). Specifically, we implement an image-to-vector encoder to extract the fine-grained global semantics, eliminating the need for sequential features. Furthermore, an elegant yet potent vector-to-sequence decoder is designed to adaptively diffuse global semantics to corresponding character locations, enabling both autoregressive and non-autoregressive decoding schemes. By executing decoding within a high-level representational space, our vector-to-sequence (V2S) approach avoids the alignment issues between visual tokens and character embeddings prevalent in traditional sequence-to-sequence methods. Remarkably, due to introducing character-wise fine-grained information, such global tokens also boost the performance of scene text retrieval tasks. Extensive experiments on synthetic and real datasets demonstrate the effectiveness of our method by achieving new state-of-the-art results on various public STR benchmarks. Code will be available.

Live content is unavailable. Log in and register to view live content