Multigrain-aware Semantic Prototype Scanning and Tri-Token Prompt Learning Embraced High-Order RWKV for Pan-Sharpening
Abstract
In this work, we propose a Multigrain-aware Semantic Prototype Scanning paradigm for pan-sharpening, built upon a KV-sharing RWKV architecture for efficient global modeling, coupled with a novel tri-token prompting mechanism derived from semantic clustering to steer the fusion process adhering to the following principles: 1) Multigrain-aware Semantic Prototype Scanning. While the RWKV model offers an efficient linear alternative, its recurrent scanning mechanism often introduces positional bias and lacks semantic guidance. To address this, we introduce a semantic-driven scanning strategy. Local hashing is first employed to generate semantic prototypes via clustering, segmenting the image into coherent regions. Our scanning mechanism is then explicitly aware of multi-grain semantic structures, allowing the model to focus on contextually relevant regions during fusion, thereby enhancing spectral integrity and spatial coherence beyond sequence-agnostic approaches. 2) Tri-token Prompt Learning. The core of our framework is a tri-token prompting mechanism: (i) a globally-sourced token to encapsulate the holistic image context, (ii) cluster-derived prototype tokens to represent distinct semantic regions, and (iii) learnable token register that acts as a dynamic buffer to explicitly identify and eliminate feature noisy artifacts that commonly arise from standard global modeling. The global and prototype tokens are broadcast as semantic prompts to guide RWKV's processing, while the register continuously refines the intermediate features. 3) Invertible Q-Shift. To counteract spatial detail, we tailor two key designs: apply a center difference convolution on value pathway within the RWKV block, actively injecting high-frequency information to preserve fine textures and moving beyond parameter-heavy receptive field expansion via invertible neural network empowered multi-scale Q-shift operation. This module performs efficient, lossless feature transformation and shifting across split channels, significantly enriching feature representation. Experimental results demonstrate superiority of our method.