Human-Centric Multi-Exposure Fusion: Benchmark and Bi-level Cognition Distillation Framework
Abstract
Multi-Exposure Fusion (MEF) seeks to generate a single high-quality image from multiple inputs captured at different exposure levels. Despite substantial progress, most existing approaches depend on statistical metrics that poorly reflect human perceptual preferences. Electroencephalography (EEG) provides a direct physiological window into human cognition, yet its use in low-level vision remains limited due to scarce paired data and the absence of bio-signals during inference. We address these challenges through two key contributions. First, we introduce Cog-Expo, the first dataset capturing human cognitive responses to multi-exposure stimuli, establishing a bridge between neuroscience and computational photography. Second, we propose a bi-level coupled learning framework that leverages this cognitive information without requiring it during inference. A Mental Integrated Transformer serves as the Teacher, incorporating cognitive priors to guide visual feature learning, while a lightweight Student is trained to approximate these cues using only image inputs. Through bi-level optimization, the Teacher learns inherently distillable representations, enabling the Student to emulate cognitive guidance efficiently. Extensive experiments confirm that our method achieves state-of-the-art fusion performance and aligns more closely with human perception.