Generalizable Co-Salient Object Detection via Mixed Content-Style Modulation
Abstract
This paper presents a generalizable CoSOD framework via mixed content-style modulation, termed CoMCS, to enhance the robustness of the model to unseen domains. The CoMCS, consisting of a mixed content modulator (MCM), a mixed style modulator (MSM), and a collaborative semantic contrast module (SCM), effectively extracts scene structure priors as well as augments the source domain styles to bridge the domain gap between the source and the unseen domains. Specifically, the CoMCS first utilizes the CLIP model to extract conceptual knowledge associated with the semantic classes in the whole scene, resulting in multi-class semantic embeddings that are domain-invariant. Subsequently, the MCM models the semantic relationships between the prototypes of co-salient objects and the multi-class semantic embeddings through the cross-attention mechanism, effectively capturing domain-invariant scene structure priors that aid in reducing scene distribution shift in unseen domains. Meanwhile, to alleviate domain perturbations encountered during testing, the MSM addresses the uncertainty associated with domain shifts by synthesizing feature statistics, such as mean and standard deviation, during training to simulate new stylistic characteristics, thus achieving data augmentation within the source domain. Finally, to reduce the ambiguity of the co-salient object representations within test data from unseen domains, the SCM employs a uniform loss function to ensure that the learned prototypes are uniformly distributed within the hyperspherical space, further enhancing the domain generalization capabilities of the framework. Besides, to further verify the generalization ability of the CoMCS to unseen domains, we construct an unseen-domain benchmark dataset (UND) that selects a variety of image groups with unseen classes from CoCA, CoSOD3k, CoSal2015. Extensive evaluations on the four benchmark datasets demonstrate favorable performance of our CoMCS to a variety of state-of-the-art methods.