SAM2Text: Towards Prompt-Free and Multi-Resolution Video Scene Text Segmentation
Abstract
We introduce a novel method for video Scene Text Segmentation (STS), a task critical for understanding dynamic visual content. Despite the success of foundation models like SAM2 in generic segmentation, their application to video STS is hindered by the reliance on external prompts, limited output resolution, and instability in video sequences. To address these, we present a comprehensive framework based on SAM2. First, we fine-tune the image encoder using LoRA and integrate a self-prompting module, enabling the model to autonomously generate text-specific prompts. Second, we augment the decoder with additional upsampling branches at 512×512 and 1024×1024 resolutions, complementing the original 256×256 output to produce high-fidelity, multi-resolution masks. Third, we enhance the memory mechanism by combining short-term memory with a top-k selection strategy, ensuring temporally consistent and stable segmentation across video frames. A significant obstacle in video STS is data scarcity. To this end, we contribute two datasets: STS-SynthV, containing 1,410 synthetic video clips generated via FlowText, and STS-RealV, comprising 660 meticulously annotated real-world video sequences. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple video and image scene text benchmarks.