Cross-modal Representation Learning for Diffusion-generated Image Detection
Abstract
The astonishing proficiency and unprecedented level of realism of diffusion models in creating and manipulating images have undoubtedly drawn concerns.Many methods have been proposed to detect generated images. Typically, they usually take RGB images as input, and use backbones like ResNet, CLIP visual encoder to extract features. Even though these backbones are capable to detect fake images, they are mainly designed to extract the high-level semantic information, rather than inherently designed for fake image detection. To this end, in this paper, we want to optimize the embedding space tailored for detecting fake images via representation learning. We notice that Neighboring Pixel Relationships (NPR) is capable to capture the intrinsic forgery clues, which means that NPR may be a good input to perform representation learning that aims at learning the embedding space tailored for detecting fake images.Therefore, we leverage features from both RGB modality and NPR modality to perform two proposed representation learning methods, Cross-Modal Contrastive Learning (CMCL) and Cross-Modal Mutual Distillation (CMMD), in order to learn the forgery-aware embedding space. The CMCL boosts the discrimination of features between real and fake images, while the CMMD simultaneously transfers the learned knowledge between two modalities, being able to learn compact features within the intra-class. CMCL and CMMD work collaboratively so that each modality learns a more comprehensive forgery-aware representation to distinguish real and fake images.Extensive experiments on GenImage, DRCT-2M, and Co-Spy-Bench datasets show that our method achieves state-of-the-art results.