Mixture-of-Experts based Feature Decoupling for Open Vocabulary Scene Graph Generation
Abstract
In recent years, while Scene Graph Generation has advancedsignificantly, mainstream methods remain constrained by pre-defined object and relationship categories, limiting general-ization to open real-world scenarios. Inspired by open vocab-ulary object detection, recent efforts have expanded SGG tothe open vocabulary domain. However, these models oftenrely on off-the-shelf VLMs, lacking discriminative attributeextraction and suffering from limited object-relationship se-mantic interaction, which leads to misclassification of un-seen categories. To address these issues, we propose the MoEFeature Decoupling (MoE-FD) framework for Open Vocab-ulary Scene Graph Generation. MoE-FD adaptively learnsfeature decoupling for objects and relationships via multipleexperts, prioritizing critical features through gating networkweights. Moreover, it models semantic interactions betweenobjects and relationships using iterative cross-attention, en-hancing relationship triple associations and visual-semanticalignment. The main contributions of MoE-FD are threefold:(1) A MoE-based feature decoupling framework that adap-tively enhances discriminative feature representation for ob-jects and relations. (2) Semantic interaction modeling be-tween objects and relations to strengthen relationship tripleassociations and image-text alignment accuracy. (3) Exten-sive experiments demonstrate the effectiveness of MoE-FDon the Visual Genome dataset.