More Than Meets the Eye: A Unified Image Fusion Framework via Semantic-Pixel Entropy Trade-off for Zero-Shot Generalization
Abstract
Existing image fusion methods face difficulties in adapting to unseen fusion tasks and have limitations in balancing semantic information with pixel-level details. This limitation can be attributed to three key challenges: (1) the lack of a unified, task-agnostic optimization objective; (2) the inherent difficulty in balancing semantic fidelity and pixel-level richness; and (3) an over-reliance on supervised learning, which limits transferability across tasks. To overcome these issues, this work proposes a unified fusion framework that generalizes to diverse fusion tasks even when trained solely on infrared–visible image pairs. Specifically, inspired by the free-energy principle, we introduce a fusion paradigm that combines high pixel-entropy expectation with low semantic-entropy expectation, and we design a frequency-aware feature decoupling mechanism to balance semantic content and pixel detail. Furthermore, an unsupervised dual-path trade-off strategy provides collaborative constraints at both semantic and pixel levels. Experiments show that our method significantly outperforms existing state-of-the-art methods in visual quality and downstream-task performance. It not only handles trained tasks efficiently but also generalizes well to unseen fusion tasks, while featuring lightweight model parameters and strong practical applicability. Code and data will be made publicly available.