Poster
DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning
Xiao-Hui Li · Fei Yin · Cheng-Lin Liu
Document image segmentation is crucial in document analysis and recognition but remains challenging due to the heterogeneity of document formats and diverse segmentation tasks. Existing methods often treat these tasks separately, leading to limited generalization and resource wastage.This paper introduces DocSAM, a transformer-based unified framework for various document image segmentation tasks, including document layout analysis, multi-granularity text segmentation, and table structure recognition by modelling these tasks as a combination of instance and semantic segmentation.Specifically, DocSAM uses a Sentence BERT to map category names from each dataset into semantic queries of the same dimension as instance queries. These queries interact through attention mechanisms and are cross-attended with image features to predict instance and semantic segmentation masks. To predict instance categories, instance queries are dot-producted with semantic queries, and scores are normalized using softmax.As a result, DocSAM can be jointly trained on heterogeneous datasets, enhancing robustness and generalization while reducing computing and storage resources. Comprehensive evaluations show that DocSAM outperforms existing methods in accuracy, efficiency, and adaptability, highlighting its potential for advancing document image understanding and segmentation in various applications.
Live content is unavailable. Log in and register to view live content