Skip to yearly menu bar Skip to main content


Poster

Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation

Yiheng Li · Yang Yang · Zichang Tan · Huan Liu · Weihua Chen · Xu Zhou · Zhen Lei


Abstract: To tackle the threat of fake news, the task of detecting and grounding multi-modal media manipulation (DGM4) has received increasing attention. However, most state-of-the-art methods fail to explore the fine-grained consistency within local contents, usually resulting in an inadequate perception of detailed forgery and unreliable results. In this paper, we propose a novel approach named Contextual-Semantic Consistency Learning (CSCL) to enhance the fine-grained perception ability of forgery for DGM4. Two branches for image and text modalities are established, each of which contains two cascaded decoders, i.e., Contextual Consistency Decoder (CCD) and Semantic Consistency Decoder (SCD), to capture within-modality contextual consistency and across-modality semantic consistency, respectively. Both CCD and SCD adhere to the same criteria for capturing fine-grained forgery details. To be specific, each module first constructs consistency features by leveraging additional supervision from the heterogeneous information of each token pair.Then, the forgery-aware reasoning or aggregating is adopted to deeply seek forgery cues based on the consistency features.Extensive experiments on DGM4 datasets prove that CSCL achieves new state-of-the-art performance, especially for the results of grounding manipulated content.Then, the forgery-aware reasoning or aggregating is adopted to deeply capture forgery cues based on the consistency features.Extensive experiments on DGM4 datasets prove that CSCL achieves new state-of-the-art results, especially for the results of grounding manipulated content.

Live content is unavailable. Log in and register to view live content