Skip to yearly menu bar Skip to main content


GRAM: Global Reasoning for Multi-Page VQA

Itshak Blau · Sharon Fogel · Roi Ronen · Alona Golts · Shahar Tsiper · Elad Ben Avraham · Aviad Aberdam · Roy Ganz · Ron Litman

Arch 4A-E Poster #96
[ ]
Thu 20 Jun 5 p.m. PDT — 6:30 p.m. PDT


The increasing use of transformer-based large language models brings forward the challenge of processing long sequences.In document visual question answering (DocVQA), leading methods focus on the single-page setting, while documents can span hundreds of pages.We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting, without requiring computationally-heavy pretraining.To do so, we leverage a single-page encoder for local page-level understanding, and enhance it with document-level designated layers and learnable tokens, facilitating the flow of information across pages for global reasoning.To enforce our model to utilize the newly introduced document-level tokens, we propose a tailored bias adaptation method.For additional computational savings during decoding, we introduce an optional compression stage using our C-Former model, which reduces the encoded sequence length, thereby allowing a tradeoff between quality and latency.Extensive experiments showcase GRAM's state-of-the-art performance on the benchmarks for multi-page DocVQA, demonstrating the effectiveness of our approach.

Live content is unavailable. Log in and register to view live content