Towards Unified Human Perception and Machine Understanding: Token Flow Guided Compression Framework
Abstract
With the rapid rise of Large Vision Language Models (LVLMs) for image understanding, the objective of image compression is gradually shifting from human visual perception to machine-oriented semantic understanding. However, conventional learned compression techniques are optimized for pixel-level fidelity and typically operate at fixed or rigid bitrate points, misaligned with the semantic consistency and flexible bitrate control. This gap becomes critical in ultra-low-bitrate regimes, where latent representations often ignore semantic relevance and struggle to disentangle meaningful content from redundant visual details as the bitrate varies. To address these challenges, we develop a token-based flexible compression framework, Token Flow Guided Compression (TFGC), which unifies human- and machine-oriented objectives. TFGC supports variable bitrate control in ultra-low bitrate regimes and enables LVLMs to directly process compressed token without image reconstruction. Specifically, we explore token flow phenomenon in 1D token sequences and exploit it to design token flow propagation, which predicts missing tokens by propagating contextual information from unmasked tokens. Moreover, token semantic guidance aligns compressed representations with LVLM semantic space, while a progressive semantic alignment training strategy further bridges the gap between perceptual reconstruction and semantic reasoning. Experiments show that our framework achieves state-of-the-art LVLMs understanding at comparable bitrates while maintaining satisfactory perceptual quality.