Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis
Abstract
Endoscopic video analysis is crucial for early gastrointestinal screening, but its progress is constrained by limited high-quality annotations. While self-supervised video pre-training shows promise, existing methods designed for natural videos tend to prioritize dense spatio-temporal modeling and exhibit motion bias, neglecting the static, structured semantics that are critical for clinical decision-making. To address this challenge, we propose Focus-to-Perceive Representation Learning (FPRL), a cognition-inspired hierarchical framework that emulates the clinical examination process of endoscopic videos. FPRL first focuses on intra-frame lesion-centric regions to learn static semantics, and then perceives their evolution across frames to model contextual semantics. To achieve this, FPRL employs a hierarchical semantic modeling mechanism that explicitly distinguishes and collaboratively learns both types of semantics. Specifically, it begins by capturing static semantics through the application of teacher-prior adaptive masking (TPAM) combined with multi-view sparse sampling. This approach mitigates redundant temporal dependencies and enables the model to concentrate on lesion-related local semantics. Following this, contextual semantics are derived through cross-view masked feature completion (CVMFC) and attention-guided temporal prediction (AGTP). These processes establish cross-view correspondences and effectively model structured inter-frame evolution, thereby reinforcing temporal semantic continuity while preserving global contextual integrity. Extensive experiments on 11 endoscopic video datasets show that FPRL achieves state-of-the-art performance across diverse downstream tasks, demonstrating its effectiveness and strong generalization in endoscopic video representation learning.