PIX-TAB: Efficient PIXel-Precise TABle Structure Recognition Approach with Speculative Decoding and Region-Based Image Segmentation
Viktor Zaytsev ⋅ Olena Vynokurova ⋅ Pavlo Tytarchuk ⋅ Dmytro Kozii ⋅ Vitalii Pohribnyi ⋅ Olga Radyvonenko ⋅ Artem Shcherbina
Abstract
Table structure recognition in document AI faces significant challenges due to layout inconsistencies, merged cells, and complex nested structures, which is further exacerbated by the scarcity of large, diverse annotated datasets. In this paper, we present PIX-TAB (Efficient PIXel-Precise TABle Structure Recognition Approach) that provides exact, pixel-level structure using a small, lightweight model that can run on-device. The approach is language-agnostic, as it allows adding support for a new languages simply by replacing the Optical Character Recognition (OCR) model without modifying to the core structure recognition model. Key innovations include: position-aware pixel-precise tokens for deterministic cell reconstruction; speculative decoding for faster sequence generation, and training-only box supervision to stabilize spatial grounding; region-based image segmentation. To mitigate data scarcity we propose a pipeline for generating a large synthetic table dataset. Experimental results validate each component. To address the limitations of existing evaluation methods we introduce $TEDS_{struct100}$ and $TEDS_{100}$ metrics. Speculative decoding approach significantly improves recognition speed while maintaining accuracy. Finally, the combined techniques enable a mobile-optimized model that is more than three times faster than the full-size version.
Successful Page Load