Skip to yearly menu bar Skip to main content


Poster

$\mathsf{LQMFormer}$:~Language-aware Query Mask Transformer for Referring Image Segmentation

Nisarg Shah · Vibashan VS · Vishal M. Patel


Abstract:

Referring Image Segmentation~(RIS) aims to segment an object described by a language expression from an image. Recently, state-of-art Transformer-based methods have been proposed to efficiently leverage cross-modal dependencies, enhancing performance for referring segmentation. Specifically, all these transformer based methods predict masks, where each query learn different objects. However, as the prediction is single-mask, it leads to Query collapse, where all query leads to same mask prediction. To address these limitations, we propose a Multi-modal Query Feature Fusion technique with two key designs: (1) Gaussian enhanced Multi-modal Fusion, a novel visual grounding mechanism for extracting rich local visual information and modeling global visual linguistic relationships in an integrated manner. (2) Language-Query Selection Module for generating diverse set of queries and scoring network that selectively updates only queries expected to be referenced by decoder. In addition, we also show that adding an auxiliary loss, to increase the distance between mask representation of Queries, help improving the performance. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our framework.

Live content is unavailable. Log in and register to view live content