Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification
Abstract
As the harm caused by fake news grows, the task of detecting and grounding multi-modal media manipulation (DGM4) is gaining more attention. Existing multimodal methods overlook fine-grained semantic alignment between visual and textual modalities, thereby limiting their ability to detect sophisticated and subtle cross-modal manipulations. To address this challenge, we present MaLSF, a novel Mask-aware Local Semantic Fusion framework that explicitly bridges words and pixels via mask-label pairs, enabling the model to perform precise reasoning over fine-grained cross-modal correspondences. MaLSF captures cross-modal local semantics through two key innovations: 1) A Bidirectional Cross-modal Verification Module (BCV) that identifies semantic conflicts between masked regions and associated labels via a bidirectional query mechanism; 2) A Hierarchical Semantic Aggregation (HSA) Module that adaptively aggregates multi-granularity local semantics into decoupled features for task-specific verification. In addition, to extract fine-grained mask-label pairs, we introduce a set of diverse mask-label pair extraction parsers. The proposed model is evaluated on multiple datasets and achieves state-of-the-art performance on both the DGM4 and multimodal fake news detection tasks. Extensive ablation studies and visualization results further verify its effectiveness and interpretability.