Skip to yearly menu bar Skip to main content


Unlocking the Potential of Pre-trained Vision Transformers for Few-Shot Semantic Segmentation through Relationship Descriptors

Ziqin Zhou · Hai-Ming Xu · Yangyang Shu · Lingqiao Liu

Arch 4A-E Poster #353
[ ]
Wed 19 Jun 10:30 a.m. PDT — noon PDT

Abstract: The recent advent of pre-trained vision transformers has unveiled a promising property: their inherent capability to group semantically related visual concepts. In this paper, we explore to harnesses this emergent feature to tackle few-shot semantic segmentation, a task focused on classifying pixels in a test image with a few example data. A critical hurdle in this endeavor is preventing overfitting to the limited classes seen during training the few-shot segmentation model. As our main discovery, we find that the concept of "relationship descriptors", initially conceived for enhancing the CLIP model for zero-shot semantic segmentation, offers a potential solution. We adapt and refine this concept to craft a relationship descriptor construction tailored for few-shot semantic segmentation, extending its application across multiple layers to enhance performance. Building upon this adaptation, we proposed a few-shot semantic segmentation framework that is not only easy to implement and train but also effectively scales with the number of support examples and categories. Through rigorous experimentation across various datasets, including PASCAL-$5^{i}$ and COCO-$20^{i}$, we demonstrate a clear advantage of our method in diverse few-shot semantic segmentation scenarios, and a range of pre-trained vision transformer models. The findings clearly show that our method significantly outperforms current state-of-the-art techniques, highlighting the effectiveness of harnessing the emerging capabilities of vision transformers for few-shot semantic segmentation.

Live content is unavailable. Log in and register to view live content