Skip to yearly menu bar Skip to main content


LQMFormer: Language-aware Query Mask Transformer for Referring Image Segmentation

Nisarg Shah · Vibashan VS · Vishal M. Patel

Arch 4A-E Poster #316
[ ]
Thu 20 Jun 10:30 a.m. PDT — noon PDT


Referring Image Segmentation~(RIS) aims to segment an object described by a language expression from an image. Recently, state-of-art Transformer-based methods have been proposed to efficiently leverage cross-modal dependencies, enhancing performance for referring segmentation. Specifically, all these transformer based methods predict masks, where each query learn different objects. However, as the prediction is single-mask, it leads to Query collapse, where all query leads to same mask prediction. To address these limitations, we propose a Multi-modal Query Feature Fusion technique with two key designs: (1) Gaussian enhanced Multi-modal Fusion, a novel visual grounding mechanism for extracting rich local visual information and modeling global visual linguistic relationships in an integrated manner. (2) Language-Query Selection Module for generating diverse set of queries and scoring network that selectively updates only queries expected to be referenced by decoder. In addition, we also show that adding an auxiliary loss, to increase the distance between mask representation of Queries, help improving the performance. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our framework.

Live content is unavailable. Log in and register to view live content